Project #4 - Introduction to Statistical Inference

Purpose – Are North Shore Students Different?

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, confidence intervals and hypothesis tests to determine if NSCC students differ from typical averages or if any differences are just due to random variation.

Question 1: Sample v. Population

Tasks:

Load and store the sample NSCC Student Dataset using the read.csv() function.

# Store the NSCC student dataset in environment
nsccstudentdata <- read.csv('C:/Users/sband/Downloads/Honors Stats/nscc_student_data.csv')

Find the sample mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.

# Mean pulse rate of this sample
mean(nsccstudentdata$PulseRate, na.rm = TRUE)

## [1] 73.47368

# Find the sample size of pulse rates (hint: its how many non-NA values are there)
table(is.na(nsccstudentdata$PulseRate))

## 
## FALSE  TRUE 
##    38     2

Questions:

Do you expect the sample mean to equal the true population mean? Why?

The sample mean will not equal the true population mean because it would be almost impossible to get the true population mean, given that we would need to gather data on every single person. The sample mean can only estimate what the population mean might be.

If we took a different sample, would we get the same results?

A different sample would not produce the same results, especially in this case where the sample size is only 38 out of about 5000 students total. When taking a smaller sample size, individual observations can vary greatly.

Question 2: Confidence Intervals

Task: Construct 90%, 95%, and 99% Confidence Intervals for the mean pulse rate of all NSCC students. Assume that σ = 14.

# Store mean
meanpulserate <- mean(nsccstudentdata$PulseRate, na.rm = TRUE)

# Calculate lower bound of 95% CI
meanpulserate - 1.96*(14/sqrt(38))

## [1] 69.02233

# Calculate upper bound of 95% CI
meanpulserate + 1.96*(14/sqrt(38))

## [1] 77.92504

# Calculate lower bound of 90% CI
meanpulserate - 1.645*(14/sqrt(38))

## [1] 69.73772

# Calculate upper bound of 90% CI
meanpulserate + 1.645*(14/sqrt(38))

## [1] 77.20964

# Calculate the lower bound of 99% CI
meanpulserate - 2.58*(14/sqrt(38))

## [1] 67.61425

# Calculate the upper bound of 99% CI
meanpulserate + 2.58*(14/sqrt(38))

## [1] 79.33312

Questions:

Interpret your 95% CI in plain language

We can be 95% confident that the average pulse rate of all NSCC students is between 69.02 and 77.93 BPM.

How does the interval change as confidence increases?

As confidence increases, the interval gets wider, meaning there is a greater chance of the true population mean being within it.

Which interval would you report and why?

I would report the 95% confidence interval in this case because I think it’s the most appropriate for the variable. While knowing the mean pulse rate of NSCC students isn’t a life-or-death situation, the 10% chance of being wrong that comes with a 90% confidence interval is still significant. On the other hand, however, a 99% confidence interval is a little too serious and unnecessary for this simple statistic.

Question 3: Hypothesis Testing with a Confidence Interval

Consider the national average pulse rate for US adults to be 72 bpm. Let’s test the claim that NSCC students differ from that national average.

\(H_0: \mu = 72\)
\(H_A: \mu \neq 72\)

Tasks:

Use your confidence interval – Does it contain 72?

Yes, the 95% confidence interval contains 72.

Based on that – Do you reject or fail to reject the null hypothesis?

We fail to reject the null hypothesis because, using the 95% confidence interval, there is a chance that the mean pulse rate of NSCC students could be equal to the mean pulse rate of all US adults of 72 BPM.

Questions:

What does your result suggest about NSCC students?

It suggests that the pulse rates of NSCC students do not seem to differ that much from the US population, but we cannot be sure.

Does “fail to reject” mean NSCC students are the same as average?

Failing to reject the null hypothesis means that NSCC students could be the same as the average US adult, but they also might not be. There is not enough evidence to point in either direction.

Question 4: Hypothesis Testing with a P-value

Task: Recall the sample data you got in question 1. For the hypotheses in question 3, compute the test statistic of that sample data and the p-value using pnorm().

# Calculate the p-value
pnorm(q = 73.47, mean = 72, sd = 14/sqrt(38), lower.tail = FALSE)

## [1] 0.2587307

Questions:

What does your p-value represent in context?

The p-value represents the probability of getting a sample mean of 73.47 if we were to select any other group of 38 people in the population, which in this case is 0.2587.

Using an α = 0.05, is your conclusion the same as with the 95% confidence interval? If not, why might they differ?

Yes, we would still fail to reject the null hypothesis because the p-value is greater than 0.05. The chances of getting a sample mean of 73.47 in any sample of the population is too great.

Question 5: Reflection

If you repeated this study of collecting NSCC students’ pulse rates to determine if they differ from the national average:

Could your conclusions change? Why?

I don’t think we would ever be able to reject the null hypothesis when it comes to NSCC students’ pulse rates against the national average because it is unlikely that a significant number of average people would have an extremely low or extremely high pulse rate. Considering this, I don’t believe the conclusions would be very different if the study was repeated.

What assumptions did you rely on in using these methods?

We can assume that the data is normally distributed because the sample size is sufficiently large enough to obey the central limit theorem. We also have to assume that the students who were sampled are comparable to the general US population, as in their pulse rates were taken in the same way as the studies of US adults. Also, as I said before, we can assume that there aren’t extreme outliers in the dataset.

Are there any limitations, flaws, questions, or concerns you have with the analysis done in this project?

I think when using pulse rates as a variable the variation can be very high when you consider age, medical status, stress level, and other factors. Perhaps we could study the pulse rate of NSCC students during a high stress time such as finals week and compare it to students at another college to see if there are significant differences. I also think the sample size is almost too small for the central limit theorem to apply. Even though it technically makes the cut, I think a larger sample size would improve the analysis.

Project #4 - Introduction to Statistical Inference

MAT143H - Introduction to Statistics Honors

Spencer Anderson

Due: Tuesday, April 7

Purpose – Are North Shore Students Different?

Question 1: Sample v. Population

Question 2: Confidence Intervals

Question 3: Hypothesis Testing with a Confidence Interval

Question 4: Hypothesis Testing with a P-value

Question 5: Reflection