Project #4 - Introduction to Statistical Inference

Purpose – Are North Shore Students Different?

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, confidence intervals and hypothesis tests to determine if NSCC students differ from typical averages or if any differences are just due to random variation.

Question 1: Sample v. Population

Tasks:

Load and store the sample NSCC Student Dataset using the read.csv() function.

# Store the NSCC student dataset in environment
getwd()

## [1] "C:/Users/ash91/OneDrive/Desktop/Honors_Stats"

nscc_student <- read.csv("C:/Users/ash91/OneDrive/Desktop/Honors_Stats/nscc_student_data.csv")

Find the sample mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.

# Mean pulse rate of this sample

mean(nscc_student$PulseRate, na.rm = TRUE)

## [1] 73.47368

# Find the sample size of pulse rates (hint: its how many non-NA values are there)

table(is.na(nscc_student$PulseRate))

## 
## FALSE  TRUE 
##    38     2

The mean pulse rate of the sample is approximately 73.5. The sample size is 38.

Questions:

Do you expect the sample mean to equal the true population mean? Why?

No, I do not expect the sample mean to be equal to the true population. The sample mean is simply an estimate or an approximation taken from a subset of data from the entire population. Therefore, there may be some variation that would cause the sample mean to differ from the true population mean. Additionally, the size of the true population is never stated, meaning it cannot be determined whether or not the sample size is large enough to be more accurate.

If we took a different sample, would we get the same results?

No, taking a different sample would most likely result in a different sample mean. There are a number of factors that would cause this, including the size of the other sample and variation within the sample.

Question 2: Confidence Intervals

Task: Construct 90%, 95%, and 99% Confidence Intervals for the mean pulse rate of all NSCC students. Assume that σ = 14.

# Store mean

mn <- mean(nscc_student$PulseRate, na.rm = TRUE)

# Calculate lower bound of 90% CI

mn - 1.645*(14/sqrt(38))

## [1] 69.73772

# Calculate upper bound of 90% CI

mn + 1.645*(14/sqrt(38))

## [1] 77.20964

# Calculate lower bound of 95% CI

mn - 1.96*(14/sqrt(38))

## [1] 69.02233

# Calculate upper bound of 95% CI

mn + 1.96*(14/sqrt(38))

## [1] 77.92504

# Calculate lower bound of 99% CI

mn - 2.58*(14/sqrt(38))

## [1] 67.61425

# Calculate upper bound of 99% CI

mn + 2.58*(14/sqrt(38))

## [1] 79.33312

Questions:

Interpret your 95% CI in plain language

The 95% confidence interval indicates that 95% of the time, the true average of the population falls between the range of approximately 69.0 and 77.9.

How does the interval change as confidence increases?

The interval increases as confidence increases.

Which interval would you report and why?

I would report the 95% confidence interval. Not only is it considered the industry standard, the 95% confidence interval offers a high percentage of confidence while still maintaining precision. Additionally, the higher the confidence interval is increased, the larger the margin of error becomes. Therefore, I believe the 95% confidence interval is a good balance.

Question 3: Hypothesis Testing with a Confidence Interval

Consider the national average pulse rate for US adults to be 72 bpm. Let’s test the claim that NSCC students differ from that national average.

\(H_0: \mu = 72\)
\(H_A: \mu \neq 72\)

Tasks:

Use your confidence interval – Does it contain 72?

Yes, because 72 falls between 69.0 and 77.9, which is the range of plausible values of the 95% confidence interval.

Based on that – Do you reject or fail to reject the null hypothesis?

Based on that, I fail to reject the null hypothesis.

Questions:

What does your result suggest about NSCC students?

The results suggest that we are 95% confident that the mean pulse rate of NSCC students falls between 69.0 and 77.9. Therefore, it is possible that NSCC students pulse rates are the same as the national average, but there is not sufficient enough evidence.

Does “fail to reject” mean NSCC students are the same as average?

“Fail to reject” does not mean NSCC students are the same as average. Failing to reject a hypothesis does not indicate that the hypothesis is correct. It strictly means that we cannot state that the hypothesis is incorrect or implausible due to insufficient evidence.

Question 4: Hypothesis Testing with a P-value

Task: Recall the sample data you got in question 1. For the hypotheses in question 3, compute the test statistic of that sample data and the p-value using pnorm().

2*pnorm(q = 73.47, mean = 72, sd = 14/sqrt(38), lower.tail = FALSE)

## [1] 0.5174614

Questions:

What does your p-value represent in context?

The p-value represents the 51.7% probability of achieving a mean of 73.47. The p-value represents the probability of achieving the observed data, and it is an indication of whether or not the observed data is real or occurred by chance.

Using an α = 0.05, is your conclusion the same as with the 95% confidence interval? If not, why might they differ?

Yes, my conclusion is the same as my p-value is larger than 0.05, indicating that it is not statistically significant, and I fail to reject the null hypothesis.

Question 5: Reflection

If you repeated this study of collecting NSCC students’ pulse rates to determine if they differ from the national average:

Could your conclusions change? Why?

Yes, my conclusions could change because the subset of data could contain different values than the ones in this study. Therefore, the sample mean would differ from the one in this study, altering the confidence intervals, hypothesis testing, and p-value.

What assumptions did you rely on in using these methods?

Using these methods, I relied on the assumption that the sample distribution is relatively normal.

Are there any limitations, flaws, questions, or concerns you have with the analysis done in this project?

One concern I have with the analysis done in this project is that the students age is never taken into account when determining whether or not their pulse rate is comparable to the national average for adults. Pulse rates can differ among age groups which could skew or alter the data. When looking at this sample, the ages are relatively close to each other, but if we were to take another sample, there may be a different variation in ages, which would affect the data collected. Additionally, there are NA values, indicating we are missing some student data within this study. However, the NA values within this specific sample are low.

Project #4 - Introduction to Statistical Inference

MAT143H - Introduction to Statistics Honors

Ashley Shepard

Due: Tuesday, April 7

Purpose – Are North Shore Students Different?

Question 1: Sample v. Population

Question 2: Confidence Intervals

Question 3: Hypothesis Testing with a Confidence Interval

Question 4: Hypothesis Testing with a P-value

Question 5: Reflection