In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, confidence intervals and hypothesis tests to determine if NSCC students differ from typical averages or if any differences are just due to random variation.
Tasks:
Load and store the sample NSCC Student Dataset using the read.csv() function.
# Store the NSCC student dataset in environment
getwd()
## [1] "/Users/jadaperez/Desktop/hon stats"
nscc_students <- read.csv("/Users/jadaperez/Desktop/hon stats/nscc_student_data.csv")
Find the sample mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.
# Mean pulse rate of this sample
mean(nscc_students$PulseRate, na.rm = TRUE)
## [1] 73.47368
# Find the sample size of pulse rates (hint: its how many non-NA values are there)
table(is.na(nscc_students$PulseRate))
##
## FALSE TRUE
## 38 2
Questions:
Do you expect the sample mean to equal the true population mean? Why? No, the sample may contain outliers causing the mean to be greater or less than the true population mean.
If we took a different sample, would we get the same results? No, a different sample would contain different values and may also contain outliers.
Task: Construct 90%, 95%, and 99% Confidence Intervals for the mean pulse rate of all NSCC students. Assume that σ = 14.
# Store mean
mean_pulse <- mean(nscc_students$PulseRate, na.rm=TRUE)
# Calculate lower bound of 95% CI
mean_pulse - 1.96*(14/sqrt(38))
## [1] 69.02233
# Calculate upper bound of 95% CI
mean_pulse + 1.96*(14/sqrt(38))
## [1] 77.92504
# Calculate lower bound of 99% CI
mean_pulse - 2.58*(14/sqrt(38))
## [1] 67.61425
# Calculate upper bound of 99% CI
mean_pulse + 2.58*(14/sqrt(38))
## [1] 79.33312
# Calculate lower bound of 90% CI
mean_pulse - 1.645*(14/sqrt(38))
## [1] 69.73772
# Calculate upper bound of 90% CI
mean_pulse + 1.645*(14/sqrt(38))
## [1] 77.20964
Questions:
I am 95% confident that the population mean pulse rate is between 69.02 and 77.93 bpm.
The interval would widen.
I’d report 95% because it’s the standard. Since I don’t know the true population mean it’s be the best option between 90% and 99%.
Consider the national average pulse rate for US adults to be 72 bpm. Let’s test the claim that NSCC students differ from that national average.
\(H_0: \mu = 72\)
\(H_A: \mu \neq 72\)
Tasks:
Yes.
I fail to reject the null hypothesis.
Questions:
There is not sufficient evidence to say the mean pulse rate of NSCC students is any different than the average US adult.
No, it means we don’t have enough evidence to say they are different, not that they’re the same.
Task: Recall the sample data you got in question 1. For the hypotheses in question 3, compute the test statistic of that sample data and the p-value using pnorm().
pnorm(73.47, 72, 14/sqrt(38), lower.tail=FALSE)
## [1] 0.2587307
2*0.2587307
## [1] 0.5174614
Questions:
The p-value represents how probable it is to get my result when the true population mean is 72. So in this case, there’s a 51.7% probability of getting a mean of 73.47 bpm when the true population mean is 72 bpm.
Yes, it is the same with the 95% confidence interval. Since my p-value is larger I will also fail to reject the null hypothesis.
If you repeated this study of collecting NSCC students’ pulse rates to determine if they differ from the national average:
Yes my conclusions could change. This is because I’d be using a different random sample with different values that would give me different results.
I assumed the population distribution is approximately normal.
There could be many concerns, for example not using fully online NSCC students in the data. This would mean the data isn’t representative of the entire population.