In this project, students will demonstrate their understanding of the inference on categorical data. There will also be inference on numerical data mixed in and the student is expected to identify the proper type of inference to apply depending on the variable type. If not specifically mentioned, students should assume a significance level of 0.05.
The project will use two datasets from the internet –
atheism and nscc_student_data. Store the
atheism dataset in your environment by using the following
R chunk and do some exploratory analysis of the dataframe. None of this
will be graded, just something for you to do on your own.
# Download atheism dataset from web
download.file("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/atheism.RData", destfile = "atheism.RData")
# Load dataset into environment
load("atheism.RData")
Load the “nscc_student_data.csv” file in the following R chunk below and refamiliarize yourself with this dataset as well.
nscc_student_data <- read.csv("nscc_student_data.csv")
# a function to familiarize myself with both datasets while calling them independently through a passed variable.
funfunction <- function(functiontarget) {
summary(functiontarget)
str(functiontarget)
head(functiontarget)
}
# callin em
funfunction(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
## Gender PulseRate CoinFlip1 CoinFlip2 Height ShoeLength Age Siblings RandomNum
## 1 Female 64 5 5 62 11.00 19 4 797
## 2 Female 75 4 6 62 11.00 21 3 749
## 3 Female 74 6 1 60 10.00 25 2 13
## 4 Female 65 4 4 62 10.75 19 1 613
## 5 Female NA NA NA 66 NA 26 6 53
## 6 Female 72 6 5 67 9.75 21 1 836
## HoursWorking Credits Birthday ProfsAge Coffee VoterReg
## 1 35 13 July 5 31 No Yes
## 2 25 12 December 27 30 Yes Yes
## 3 30 6 January 31 29 Yes No
## 4 18 9 6-13 31 Yes Yes
## 5 24 15 02-15 32 No Yes
## 6 15 9 april 14 32 No Yes
funfunction(atheism)
## 'data.frame': 88032 obs. of 3 variables:
## $ nationality: Factor w/ 57 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ response : Factor w/ 2 levels "atheist","non-atheist": 2 2 2 2 2 2 2 2 2 2 ...
## $ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## nationality response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012
In the 2010 playoffs, the National Football League (NFL) changed
their overtime rules amid concerns that whichever team won an overtime
coin toss (by luck) had a significant advantage to win the game.
Nicholas Gorgievski, et al, published research in 2010 that stated that
out of 414 games won in overtime up to that point, 235 were won by the
team that won the coin toss. Test the claim that the team which wins the
coin flip wins games more often than its opponent.
(Hint: You must recognize what percent of games you’d expect a team to
win if they do not win any more or less than their opponent.)
We will use a one-tailed test focusing on the right tail, evaluating whether winning a game after a coin toss occurs more frequently than would be expected by chance alone. Accepting the null hypothesis implies winning the coin toss does not confer a significant advantage. Rejecting the null hypothesis implies winning the coin toss may provide a significant advantage to the coin-toss winning team in winning the game as well.
# Calculate sample proportion of games won by coin flip team
# Sample data: 235 wins out of 414 games for the coin toss winning team
wins_after_tosswin <- 235
total_games <- 414
# Perform the test using prop.test
result <- prop.test(x = wins_after_tosswin, n = total_games, p = 0.5, alternative = "greater", conf.level = 0.95)
# Output results
print(result)
##
## 1-sample proportions test with continuity correction
##
## data: wins_after_tosswin out of total_games, null probability 0.5
## X-squared = 7.3068, df = 1, p-value = 0.003435
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.52606 1.00000
## sample estimates:
## p
## 0.5676329
The calculated p-value is 0.003435. The p-value is lower than (0.05), therefore we have enough evidence to reject the null hypothesis. This means that there is strong evidence to suggest that the team winning the coin toss has a higher chance of winning the game itself.
# Data
p0 <- 0.5 # Null hypothesis probability
# SE
se <- sqrt(p0 * (1 - p0) / total_games)
# Test statistic
phat <- wins_after_tosswin / total_games # Sample proportion
z <- (phat - p0) / se
# p-value
p_value <- 1 - pnorm(z)
# Output results
cat("Standard Error:", se, "\nTest Statistic (Z):", z, "\nP-value:", p_value)
## Standard Error: 0.02457366
## Test Statistic (Z): 2.75225
## P-value: 0.002959367
Standard Error: 0.02457366 Test Statistic (Z): 2.75225 P-value: 0.002959367
Based on the results of this hypothesis test, showing a p-value at 0.003435, significantly less than the level of 0.05, the decision is to reject the null hypothesis.
There is statistically significant evidence to support the claim that the team winning the coin toss in NFL overtime games has a greater probability of winning the game. This conclusion suggests that the rule change to address the potential advantage from the coin toss win was accurate to current understanding of statistics given the history of the coin toss and game outcomes.
For questions 2 and 3, consider the atheism dataset loaded at the beginning of the project. An atheism index is defined as the percent of a population that identifies as atheist. Is there convincing evidence that Spain has seen a change in its atheism index from 2005 to 2012?
# Create subsets for Spain 2005 and 2012
spain2005 <- subset(atheism, nationality == "Spain" & year == 2005)
spain2012 <- subset(atheism, nationality == "Spain" & year == 2012)
# Create numeric values from strings
spain2005$atheist_numeric <- as.integer(spain2005$response != "non-atheist")
spain2012$atheist_numeric <- as.integer(spain2012$response != "non-atheist")
# Totals 2005
x2005 <- sum(spain2005$atheist_numeric) # Number of atheists in 2005
n2005 <- nrow(spain2005) # Total surveyed in 2005
# Totals 2012
x2012 <- sum(spain2012$atheist_numeric) # Number of atheists in 2012
n2012 <- nrow(spain2012) # Total surveyed in 2012
# Calculate test
test_result <- prop.test(c(x2005, x2012), c(n2005, n2012), alternative = "two.sided", conf.level = 0.95)
# Output
print(test_result)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x2005, x2012) out of c(n2005, n2012)
## X-squared = 0.60285, df = 1, p-value = 0.4375
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01450680 0.03529222
## sample estimates:
## prop 1 prop 2
## 0.10034904 0.08995633
\(H_0 : p_{2012} = p_{2005}\)
\(H_\alpha : p_{2012} \neq p_{2005}\)
This will be a two-tailed test. We are trying to determine the variance (positive or negative) between the two years.
The p-value of this sample data occurring by chance is 0.4375 (much higher than of 0.05), which indicates that this sample data could have occurred by chance.
Given that the p-value is so high, we are not able to reject the null hypothesis. These data are not sufficient to conclude that there was a significant difference in the rate of atheism in Espana between the years 2005 and 2012.
This analysis did not find statistically significant evidence to suggest that the atheism index in España (Spain) has changed from 2005 to 2012. The changes observed in the proportion of individuals identifying as atheists are likely attributable to random variance rather than a larger trend.
Is there convincing evidence that the United States has seen a change in its atheism index from 2005 to 2012?
# Create subsets for USA 2005 and 2012
USA2005 <- subset(atheism, nationality == "United States" & year == 2005)
USA2012 <- subset(atheism, nationality == "United States" & year == 2012)
# Create numeric values from strings
USA2005$atheist_numeric <- as.integer(USA2005$response != "non-atheist")
USA2012$atheist_numeric <- as.integer(USA2012$response != "non-atheist")
# Counts
x2005 <- sum(USA2005$atheist_numeric) # Number of atheists in 2005
n2005 <- nrow(USA2005) # Total surveyed in 2005
# Counts
x2012 <- sum(USA2012$atheist_numeric) # Number of atheists in 2012
n2012 <- nrow(USA2012) # Total surveyed in 2012
# Calculate test
test_result <- prop.test(c(x2005, x2012), c(n2005, n2012), alternative = "two.sided", conf.level = 0.95)
# Output
print(test_result)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x2005, x2012) out of c(n2005, n2012)
## X-squared = 26.132, df = 1, p-value = 3.188e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.05573843 -0.02410189
## sample estimates:
## prop 1 prop 2
## 0.00998004 0.04990020
\(H_0 : p_{2012} = p_{2005}\)
\(H_\alpha : p_{2012} \neq p_{2005}\)
This will also be a two-tailed test, because we are trying to determine the variance (positive or negative) between the two years, but in los Estados Unidos.
The p-value of this sample data occurring by chance is 3.188e-07, which is much lower than the conventional of 0.05. This indicates that this sample data could not have occurred by chance.
Given that the p-value is so low, we are able to reject the null hypothesis. These data are sufficient to conclude that there was a significant difference in the rate of atheism in the United States of America (los Estados Unidos de America) between the years 2005 and 2012.
This analysis found statistically significant evidence to suggest that the atheism index in the USA has changed from 2005 to 2012. These data also suggest that there was a significant increase in the rate of atheism in the USA during this period. In contrast to the data from Spain in the same period, the trend highlights a shift in the proportion of individuals identifying as atheist in the United States, signaling the growth of the secular movement in American society.
Suppose you’re hired by the local government to estimate the proportion of residents in your state that attend a religious service on a weekly basis. According to the guidelines, the government desires a 95% confidence interval with a margin of error no greater than 3%. You have no idea what to expect for \(\hat{p}\). How many people would you have to sample to ensure that you are within the specified margin of error and confidence level?
The formula for this is the minimum sample size formula:
\(n = \left( \frac{Z \cdot \sqrt{p \cdot (1-p)}}{E} \right)^2\)
Where:
\(n\) is the sample size,
\(Z\) is the z-score corresponding to the desired confidence level (e.g., \(Z = 1.96\) for 95% confidence),
\(p\) is the estimated proportion of the population that exhibits the characteristic of interest,
\(E\) is the desired margin of error (e.g., 0.03 or 3%).
# Constants
z_score <- 1.96 # Z-value for 95% confidence
p <- 0.5 # Conservative estimate of the proportion
margin_of_error <- 0.03 # "Good Enoguh for Government Work" margin of error
# Sample size calcy
n <- (z_score * sqrt(p * (1 - p)) / margin_of_error)^2
# Output the required sample size
n
## [1] 1067.111
As n = 1067.111, we would need a sample of 1068 individuals or greater to ensure that any survey would capture enough information for the government’s standards. While the dataset we already have does not capture the information needed by the government, in that this request implies specificity within the US and restricting the data to individual states within that country, specifically the proportion of residents in a state that attend a weekly religious service, it is still enough that we could hone in on the proportion of the total population for any global region who are religious, and supplement that with information on the likelihood that a religious identifying person would attend a service weekly in any given geographic zone. The sample size of 88,032 observations is at least large enough that we would be able to get useful data to supplement a new survey.
Use the NSCC Student Dataset for the Questions 5-7.
Construct a 95% confidence interval of the true proportion of all NSCC students that are registered voters.
# Converting 'VoterReg' from "Yes"/"No" to a binary 1 and 0
nscc_student_data$VoterRegNumeric <- as.integer(nscc_student_data$VoterReg == "Yes")
# Calculate the sample proportion of students whoa re registered to vote
p_hat <- mean(nscc_student_data$VoterRegNumeric, na.rm = TRUE) # na.rm = TRUE to ignore NA values, if any
# Number of students surveyd (excluding NAs in VoterReg)
n <- sum(!is.na(nscc_student_data$VoterRegNumeric))
# Standard error calcy
SE <- sqrt(p_hat * (1 - p_hat) / n)
# Z-score for a 95% confidence interval
Z <- 1.96
# Margin of error calculation
MoE <- Z * SE
# Confidence interval calculation
lower_bound <- p_hat - MoE
upper_bound <- p_hat + MoE
(lower_bound)
## [1] 0.6455899
(upper_bound)
## [1] 0.9044101
When estimating the proportion of NSCC students who are registered voters, we are 95% confident that the true proportion falls between 64.6% and 90.4% (interval: 0.6455899 to 0.9044101). This range represents our best estimate based on the sample data, of the percentage of students at NSCC who are registered to vote.
Construct a 95% confidence interval of the average height of all NSCC students.
# I keep thinking I should just write functions to do this stuff instead of copying and pasting it then editing, but whatever, shouldn't be lazy
# Calculate the sample mean
mean_height <- mean(nscc_student_data$Height, na.rm = TRUE) # also removing NAs
# Calculate the standard deviation
std_dev <- sd(nscc_student_data$Height, na.rm = TRUE)
# Count non-NA heights (but didn't get rid of that one kid who claimed to be six inches tall)
n <- sum(!is.na(nscc_student_data$Height))
# Calculate Standard Error of the Mean
SEoM <- std_dev / sqrt(n)
# Z-score for 95% confidence
Z <- 1.96
# Calculate Margin of Error
MoE <- Z * SEoM
# Calculate the 95% confidence interval
lower_bound <- mean_height - MoE
upper_bound <- mean_height + MoE
When estimating the average height of NSCC students, we are 95% confident that the true average height lies between 61.2 and 67.9 inches (interval 61.19633 to 67.85239). This confidence interval reflects the range within which the average height of all students at NSCC is likely to fall, given the sample data we analyzed.
Starbucks is considering opening a coffee shop on NSCC Danvers campus if they believe that more NSCC students drink coffee than the national proportion. A Gallup poll in 2015 found that 64% of all Americans drink coffee. Conduct a hypothesis test to determine if more NSCC students drink coffee than other Americans.
# Trying this thing where I keep all the code itself in one chunk at the top. I've been working with Python a lot for work and I think the R-Markdown method of splicing several languages into one document with HTML makes it a bit too easy to be disorganized and ruin the "flow".
#
# Sally forth:
#
# Convert 'Coffee' from strings "Yes"/"No" to binary 1/0
nscc_student_data$CoffeeNumeric <- as.integer(nscc_student_data$Coffee == "Yes")
# Calculate the sample proportion of coffee drinkers
n <- length(nscc_student_data$CoffeeNumeric) # total number of students surveyed
x <- sum(nscc_student_data$CoffeeNumeric, na.rm = TRUE) # count of 'Yes' responses, handling NA values if necessary
p_hat <- x / n # sample proportion
# Null hypothesis
p_null <- 0.64 # National proportion of coffee drinkers
SE <- sqrt(p_null * (1 - p_null) / n) # Standard Error
# Calculate the Z-score
Z <- (p_hat - p_null) / SE # Z score evaluates the proportion of the difference between each p to the standard error
# Calculate the p-value for a right-tailed test
p_value <- 1 - pnorm(Z)
# Output the results
cat("Sample Size:", n, "\nNumber of Coffee Drinkers:", x, "\nSample Proportion:", p_hat, "\nZ-Score:", Z, "\nP-Value:", p_value, "\n")
## Sample Size: 40
## Number of Coffee Drinkers: 30
## Sample Proportion: 0.75
## Z-Score: 1.449377
## P-Value: 0.07361613
a.) Write hypotheses and determine tails of the test
\(H_0 : p_{NSCC} \leq 0.64\)
\(H_\alpha : p_{NSCC} > 0.64\)
We use a one-tailed test, focusing on the right tail, because we are interested in knowing whether the proportion of NSCC coffee drinkers is greater than the national average from the Gallup polling data.
b.) Calculate sample statistics
cat("\n The calculated sample size: ", n)
##
## The calculated sample size: 40
cat("\n The summed number of coffee drinkers: ", x)
##
## The summed number of coffee drinkers: 30
The calculated sample size: 40
The summed number of coffee drinkers: 30
(30 / 40 = 0.75)
c.) Determine probability of getting sample data by chance
cat("\n Z-score: ", Z)
##
## Z-score: 1.449377
cat("\n p-value: ", p_value)
##
## p-value: 0.07361613
Z-score: 1.449377
P-value: 0.07361613
d.) Decision
# Tidy little if-else statement for the hypothesis testing :)
alpha <- 0.05
if (p_value < alpha) {
cat("Reject the null hypothesis: There is significant evidence to conclude that more NSCC students drink coffee compared to the national average.")
} else {
cat("Fail to reject the null hypothesis: There is not sufficient evidence to conclude that more NSCC students drink coffee compared to the national average.")
}
## Fail to reject the null hypothesis: There is not sufficient evidence to conclude that more NSCC students drink coffee compared to the national average.
We have failed to reject the null hypothesis. There is insufficient evidence to conclude that more NSCC students drink coffee compared to the national average.
e.) Conclusion
Given the evidence, we failed to reject the null hypothesis. There is not sufficient evidence to conclude that more NSCC students drink coffee when compared to the national average from the Gallup poll. Further data to suppor the idea that the presence of a corporate coffee kiosk on-campus would increase the likelihood that students would begin to consume coffee is advised.
Declaration of Conflict of Interest
The author wishes to declare a conflict of interest and make it known that as of the publishing of this R-markdown he is celebrating two years without drinking a single cup of coffee.