In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.
The project will use two datasets from the internet –
atheism and nscc_student_data. Store the
atheism dataset in your environment by using the following
R chunk and do some exploratory analysis of the dataframe. None of this
will be graded, just something for you to do on your own.
# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")
# Load dataset into environment
load("atheism.RData")
Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.
#Load the dataset into R and attempt to better understand the information
nscc <- read.csv("nscc_student_data.csv")
str(nscc)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
In the 2010 playoffs, the National Football League (NFL) changed
their overtime rules amid concerns that whichever team won an overtime
coin toss (by luck) had a significant advantage to win the game.
Nicholas Gorgievski, et al, published research in 2010 that stated that
out of 414 games won in overtime up to that point, 235 were won by the
team that won the coin toss. Test the claim that the team which wins the
coin flip wins games more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to
win if they do not win any more or less than their opponent.)
Write hypotheses and determine tails of the test
\[H_0: p = .5\] \[H_A: p > .5\]
Find p-value of sample data occurring by chance
# Calculate sample proportion of games won by coin flip team
#Store games won by the winning coin toss team and total games won in OT
x <- 235
n <- 414
p.hat <- 235/414
# SE
se <- sqrt(.5*(1-.5)/n)
# Test statistic
zp <- abs(x/n-.5)/se
# p-value
pnorm(zp, lower.tail = FALSE)*2
## [1] 0.005918735
Decision based on a significance level of .05 p < .05 so we reject the null hypothesis.
State conclusion
The data suggests that the team that wins the coin flip wins games more
often than its opponent.
For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?
# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)
Write hypotheses and determine tails of the test \[H_0: p_1 = p_2\] \[H_A: p_1 \neq p_2\]
Find p-value of sample data occurring by chance
#Store amount of people that identify as atheist for Spain in 2005 and 2012 and total amount of responses
table(atheism)
## , , year = 2005
##
## response
## nationality atheist non-atheist
## Afghanistan 0 0
## Argentina 20 982
## Armenia 0 0
## Australia 0 0
## Austria 100 903
## Azerbaijan 0 0
## Belgium 0 0
## Bosnia and Herzegovina 90 910
## Brazil 0 0
## Bulgaria 50 941
## Cameroon 25 479
## Canada 60 943
## China 0 0
## Colombia 18 588
## Czech Republic 200 800
## Ecuador 4 396
## Fiji 0 0
## Finland 69 915
## France 234 1437
## Georgia 0 0
## Germany 50 452
## Ghana 0 1505
## Hong Kong 0 0
## Iceland 51 801
## India 44 1047
## Iraq 0 0
## Ireland 0 0
## Italy 59 928
## Japan 276 924
## Kenya 0 1000
## Korea, Rep (South) 168 1356
## Lebanon 0 0
## Lithuania 20 1005
## Macedonia 36 1173
## Malaysia 21 499
## Moldova 22 1063
## Netherlands 35 470
## Nigeria 10 1039
## Pakistan 27 2678
## Palestinian territories (West Bank and Gaza) 0 0
## Peru 24 1183
## Poland 10 510
## Romania 10 1040
## Russian Federation 40 960
## Saudi Arabia 0 0
## Serbia 41 996
## South Africa 2 198
## South Sudan 0 0
## Spain 115 1031
## Sweden 0 0
## Switzerland 35 472
## Tunisia 0 0
## Turkey 0 0
## Ukraine 41 972
## United States 10 992
## Uzbekistan 0 0
## Vietnam 5 495
##
## , , year = 2012
##
## response
## nationality atheist non-atheist
## Afghanistan 0 1031
## Argentina 70 921
## Armenia 10 485
## Australia 104 935
## Austria 100 902
## Azerbaijan 0 509
## Belgium 42 485
## Bosnia and Herzegovina 40 960
## Brazil 20 1982
## Bulgaria 19 987
## Cameroon 15 489
## Canada 90 912
## China 235 265
## Colombia 18 588
## Czech Republic 300 700
## Ecuador 8 396
## Fiji 10 1008
## Finland 59 926
## France 485 1203
## Georgia 10 990
## Germany 75 427
## Ghana 0 1490
## Hong Kong 45 455
## Iceland 85 767
## India 33 1059
## Iraq 0 1000
## Ireland 100 910
## Italy 79 908
## Japan 372 840
## Kenya 20 980
## Korea, Rep (South) 229 1294
## Lebanon 10 495
## Lithuania 10 1005
## Macedonia 12 1197
## Malaysia 0 520
## Moldova 54 1031
## Netherlands 71 438
## Nigeria 10 1039
## Pakistan 54 2650
## Palestinian territories (West Bank and Gaza) 25 602
## Peru 36 1171
## Poland 26 499
## Romania 10 1029
## Russian Federation 60 940
## Saudi Arabia 25 475
## Serbia 31 1005
## South Africa 8 194
## South Sudan 61 959
## Spain 103 1042
## Sweden 40 455
## Switzerland 46 467
## Tunisia 0 498
## Turkey 21 1011
## Ukraine 30 983
## United States 50 952
## Uzbekistan 10 490
## Vietnam 0 500
x1 <- 113
x2 <- 103
n1 <- 1031 + 113
n2 <- 1042 + 103
#Calculate p-value and make decision based on a significance level of .05
x.list <- c(x1, x2)
n.list <- c(n1, n2)
prop.test(x = x.list, n = n.list, alternative = "two.sided", correct = FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x.list out of n.list
## X-squared = 0.5209, df = 1, p-value = 0.4705
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01512950 0.03276928
## sample estimates:
## prop 1 prop 2
## 0.09877622 0.08995633
Decision
Based on a p-value > .05, we fail to reject the null
hypothesis.
State conclusion
There is not enough data to suggest that Spain has seen a change in its
atheism index from 2005 to 2012.
Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?
# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)
Write hypotheses and determine tails of the test \[H_0: p_1 = p_2\] \[H_A: p_1 \neq p_2\]
Find p-value of sample data occurring by chance
#Store amount of people that identify as atheist for the United States in 2005 and 2012, while also storing the total amount of responses
table(USA2005)
## , , year = 2005
##
## response
## nationality atheist non-atheist
## Afghanistan 0 0
## Argentina 0 0
## Armenia 0 0
## Australia 0 0
## Austria 0 0
## Azerbaijan 0 0
## Belgium 0 0
## Bosnia and Herzegovina 0 0
## Brazil 0 0
## Bulgaria 0 0
## Cameroon 0 0
## Canada 0 0
## China 0 0
## Colombia 0 0
## Czech Republic 0 0
## Ecuador 0 0
## Fiji 0 0
## Finland 0 0
## France 0 0
## Georgia 0 0
## Germany 0 0
## Ghana 0 0
## Hong Kong 0 0
## Iceland 0 0
## India 0 0
## Iraq 0 0
## Ireland 0 0
## Italy 0 0
## Japan 0 0
## Kenya 0 0
## Korea, Rep (South) 0 0
## Lebanon 0 0
## Lithuania 0 0
## Macedonia 0 0
## Malaysia 0 0
## Moldova 0 0
## Netherlands 0 0
## Nigeria 0 0
## Pakistan 0 0
## Palestinian territories (West Bank and Gaza) 0 0
## Peru 0 0
## Poland 0 0
## Romania 0 0
## Russian Federation 0 0
## Saudi Arabia 0 0
## Serbia 0 0
## South Africa 0 0
## South Sudan 0 0
## Spain 0 0
## Sweden 0 0
## Switzerland 0 0
## Tunisia 0 0
## Turkey 0 0
## Ukraine 0 0
## United States 10 992
## Uzbekistan 0 0
## Vietnam 0 0
table(USA2012)
## , , year = 2012
##
## response
## nationality atheist non-atheist
## Afghanistan 0 0
## Argentina 0 0
## Armenia 0 0
## Australia 0 0
## Austria 0 0
## Azerbaijan 0 0
## Belgium 0 0
## Bosnia and Herzegovina 0 0
## Brazil 0 0
## Bulgaria 0 0
## Cameroon 0 0
## Canada 0 0
## China 0 0
## Colombia 0 0
## Czech Republic 0 0
## Ecuador 0 0
## Fiji 0 0
## Finland 0 0
## France 0 0
## Georgia 0 0
## Germany 0 0
## Ghana 0 0
## Hong Kong 0 0
## Iceland 0 0
## India 0 0
## Iraq 0 0
## Ireland 0 0
## Italy 0 0
## Japan 0 0
## Kenya 0 0
## Korea, Rep (South) 0 0
## Lebanon 0 0
## Lithuania 0 0
## Macedonia 0 0
## Malaysia 0 0
## Moldova 0 0
## Netherlands 0 0
## Nigeria 0 0
## Pakistan 0 0
## Palestinian territories (West Bank and Gaza) 0 0
## Peru 0 0
## Poland 0 0
## Romania 0 0
## Russian Federation 0 0
## Saudi Arabia 0 0
## Serbia 0 0
## South Africa 0 0
## South Sudan 0 0
## Spain 0 0
## Sweden 0 0
## Switzerland 0 0
## Tunisia 0 0
## Turkey 0 0
## Ukraine 0 0
## United States 50 952
## Uzbekistan 0 0
## Vietnam 0 0
x3 <- 10
x4 <- 50
n3 <- 1002
n4 <- 1002
#Calculate p-value and make decision based on a significance level of .05
x.list2 <- c(x3, x4)
n.list2 <- c(n3, n4)
prop.test(x = x.list2, n = n.list2, alternative = "two.sided", correct = FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x.list2 out of n.list2
## X-squared = 27.49, df = 1, p-value = 1.579e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.05474042 -0.02509990
## sample estimates:
## prop 1 prop 2
## 0.00998004 0.04990020
Decision
Based on p-value < .05 , < .001, <.0001, we reject the null
hypothesis with confidence.
State conclusion
The data suggests that the United States has seen a change in its
atheism index from 2005 to 2012.
Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 3%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?
#Calculate minimum sample size for for 95% CI with a margin of error no greater than 3%...
1.96^2*(.5)*(.5)/(.03^2)
## [1] 1067.111
For E of 3 points n = 1068.
Use the NSCC Student Dataset for the Questions 5-7.
Construct a 95% confidence interval of the true proportion of all NSCC
students that are registered voters.
#Find sample proportion of NSCC students that are registered voters
table(nscc$VoterReg == "Yes")
##
## FALSE TRUE
## 9 31
x5 <- 31
n5 <- 40
p.hat2 <- 31/40
#Calculate SE
se2 <- sqrt(p.hat2*(1-p.hat2)/n5)
#Lower bound of CI
p.hat2 - 1.96*se2
## [1] 0.6455899
#Upper bound of CI
p.hat2 + 1.96*se2
## [1] 0.9044101
The 95% confidence interval for the true proportion of NSCC students that are registered voters is between 64.56% and 90.44%.
Construct a 95% confidence interval of the average height of all NSCC students.
#Find the mean height and SD for the sample proportion of NSCC students and find the sample size not including NA
mean <- mean(nscc$Height, na.rm = TRUE)
sd <- sd(nscc$Height, na.rm = TRUE)
n6 <- sum(!is.na(nscc$Height))
#Calculate t-critical value for 95% confidence
t <- abs(qt(p = .025, df = n6 - 1))
#Calculate margin of error
me <- t * sd/sqrt(n6)
#Boundaries of confidence interval
mean - t*sd/sqrt(n6)
## [1] 61.08699
mean + t*sd/sqrt(n6)
## [1] 67.96173
We can be 95% confident that the average height of all NSCC students is between 61.09 and 67.96.
Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.
a.) Write hypotheses and determine tails of the test
\[H_0: p = .64\] \[H_A: p > .64\]
b.) Calculate sample statistics
#Find sample proportion of NSCC students that drink coffee
table(nscc$Coffee == "Yes")
##
## FALSE TRUE
## 10 30
x7 <- 30
n7 <- 40
c.) Determine probability of getting sample data by chance
#Calculate p-value and evaluate based on a significance level of .05
prop.test(x = x7, n = n7, p = .64, alternative = "greater", correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: x7 out of n7, null probability 0.64
## X-squared = 2.1007, df = 1, p-value = 0.07362
## alternative hypothesis: true p is greater than 0.64
## 95 percent confidence interval:
## 0.6240271 1.0000000
## sample estimates:
## p
## 0.75
d.) Decision
Based on a p-value > .05, we fail to reject the null hypothesis.
e.) Conclusion
There is not enough data to suggest that NSCC students drink more coffee
than the national proportion.