In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, confidence intervals and hypothesis tests to determine if NSCC students differ from typical averages or if any differences are just due to random variation.
Tasks:
Load and store the sample NSCC Student Dataset using the read.csv() function.
# Store the NSCC student dataset in environment
nscc <- read.csv("nscc_student_data.csv")
Find the sample mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.
# Mean pulse rate of this sample
mean(nscc$PulseRate, na.rm = TRUE)
## [1] 73.47368
# Find the sample size of pulse rates (hint: its how many non-NA values are there)
sum(!is.na(nscc$PulseRate))
## [1] 38
Questions:
Do you expect the sample mean to equal the true population mean? Why? No, I do not expect the sample mean to equal the true population mean because of random sampling, it is an estimate not the exact value.
If we took a different sample, would we get the same results? No, we would not get the same results if we took a different sample because it would produce different means leading to different results.
Task: Construct 90%, 95%, and 99% Confidence Intervals for the mean pulse rate of all NSCC students. Assume that σ = 14.
# Store mean
mn <- mean(nscc$PulseRate, na.rm = TRUE)
mn
## [1] 73.47368
#Store sample size
ss <- sum(!is.na(nscc$PulseRate))
ss
## [1] 38
# Calculate lower bound of 90% CI
mn - 1.645*(14/sqrt(ss))
## [1] 69.73772
# Calculate upper bound of 90% CI
mn + 1.645*(14/sqrt(ss))
## [1] 77.20964
# Calculate lower bound of 95% CI
mn - 1.96*(14/sqrt(ss))
## [1] 69.02233
# Calculate upper bound of 95% CI
mn + 1.96*(14/sqrt(ss))
## [1] 77.92504
# Calculate lower bound of 99% CI
mn - 2.58*(14/sqrt(ss))
## [1] 67.61425
# Calculate upper bound of 99% CI
mn + 2.58*(14/sqrt(ss))
## [1] 79.33312
Questions:
Interpret your 95% CI in plain language I am 95% confident that the true mean pulse rate for all NSCC students is between 69.02 and 77.93 bpm.
How does the interval change as confidence increases? As the confidence level increases, the interval becomes wider, increasing the likelihood that the interval includes the true population mean.
Which interval would you report and why? I would report the 95% interval because it accounts for accuracy and precision.
Consider the national average pulse rate for US adults to be 72 bpm. Let’s test the claim that NSCC students differ from that national average.
\(H_0: \mu = 72\)
\(H_A: \mu \neq 72\)
Tasks:
Use your confidence interval – Does it contain 72?
Based on that – Do you reject or fail to reject the null hypothesis?
Questions:
What does your result suggest about NSCC students? There isn’t sufficient evidence to say that NSCC students differ from 72 bpm.
Does “fail to reject” mean NSCC students are the same as average? Failing to reject the null hypothesis does not mean NSCC students are the same as average because there isn’t enough evidence to say they’re different.
Task: Recall the sample data you got in question 1. For the hypotheses in question 3, compute the test statistic of that sample data and the p-value using pnorm().
#Calculate and store z-test comparing sample mean to 72 bpm
z <- (mn - 72)/ (14/sqrt(ss))
z
## [1] 0.6488857
# Calculate the p-value
2*pnorm(abs(z), lower.tail = FALSE)
## [1] 0.5164123
Questions:
What does your p-value represent in context? The p-value represents the probability of getting a sample mean as extreme as mine if the true population was actually 72 bpm.
Using an α = 0.05, is your conclusion the same as with the 95% confidence interval? If not, why might they differ? Yes, my conclusion is the same as the 95% because there isn’t enough evidence to say that NSCC students differ from the national average.
If you repeated this study of collecting NSCC students’ pulse rates to determine if they differ from the national average:
Could your conclusions change? Why? Yes, my conclusions would change because using a different sample would lead to producing a different mean leading to a different confidence interval or p-value.
What assumptions did you rely on in using these methods?
The assumptions I used to rely on these methods are the sigma being
known: 14 and that the sampling distribution is approximately
normal.
Are there any limitations, flaws, questions, or concerns you have with the analysis done in this project? The possible limitations or concerns are that there is missing data in the data set and the sample may not truly represent all NSCC students.