Project #4 - Introduction to Statistical Inference

Purpose

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, and confidence intervals and hypothesis tests.

Question 1

Assume IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. If a person is randomly selected, find each of the requested probabilities, using r chunks below. Here, x, denotes the IQ of the randomly selected person.

P(x > 120)

# Probability of a person having an IQ greater than 120
prob_greater_than_120 <- pnorm(120, mean = 100, sd = 15, lower.tail = FALSE)
prob_greater_than_120 # printing the probability of finding an IQ greater than 120.

## [1] 0.09121122

P(x < 120)

# Probability of a person having an IQ less than 120
prob_less_than_120 <- pnorm(120, mean = 100, sd = 15)
prob_less_than_120 # printout of the probability of finding someone with an IQ under 120.

## [1] 0.9087888

Question 2

What is the probability that a random selected student will have an IQ between 80 and 120? (We haven’t explicitly learned this. Think about it though.)

# Probability of an IQ between 80 and 120
prob_between <- pnorm(120, mean = 100, sd = 15) - pnorm(80, mean = 100, sd = 15)
prob_between # printou tof the pnorm for an IQ in the middle 50% of the data.

## [1] 0.8175776

Suppose that a random sample of 12 students is selected. What is the probability that their mean IQ is greater than 120?

# Finding the probability that a sample mean IQ of 12 students is greater than 120
prob_mean_greater_than_120 <- pnorm(120, mean = 100, sd = 15 / sqrt(12), lower.tail = FALSE)
prob_mean_greater_than_120 # printout of the probability that the mean is greater than 120 of a sample of a dozen students.

## [1] 1.929808e-06

Question 3

Load and store the sample NSCC Student Dataset using the read.csv() function. Find the mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.

# Store the NSCC student dataset in environment
nscc_student_data <- read.csv('nscc_student_data.csv')

# Find the mean pulse rate of this sample
mean_pulse <- mean(nscc_student_data$PulseRate, na.rm = TRUE)

# Find the sample size of pulse rates (hint: its how many non-NA values are there)
# Sample size (excluding NAs)
sample_size <- sum(!is.na(nscc_student_data$PulseRate))
sample_size # Excluding NA values

## [1] 38

sample_sizeNA <- sum(is.na(nscc_student_data$PulseRate))
sample_sizeNA # Just NA values, not included in the above answer, but provided for context because I'm trying.

## [1] 2

Do you think it is likely or unlikely that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found?
I think it is unlikely that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found.

Question 4

If we assume the mean pulse rate for all NSCC students is \(\sigma = 14\), construct a 95% confidence interval for the mean pulse rate of all NSCC students and conclude in a complete sentence below.
(Note: we can create a valid confidence interval here since n > 30)

# Store mean
mean_pulse <- mean(nscc_student_data$PulseRate, na.rm = TRUE)
mean_pulse

## [1] 73.47368

# 95% CI calculation
t_crit <- qt(0.975, sample_size - 1)
margin_error <- t_crit * (14 / sqrt(sample_size))
t_crit # t Critical

## [1] 2.026192

margin_error # margin of error

## [1] 4.601685

# Calculate lower bound of 95% CI
ci_lower <- mean_pulse - margin_error
ci_lower # lower bound of confidence interval

## [1] 68.872

# Calculate upper bound of 95% CI
ci_upper <- mean_pulse + margin_error
ci_upper # upper bound of confidence interval

## [1] 78.07537

I am 95% confident that the the true mean pulse rate of all NSCC students lies between 68.8 and 78.1 beats per minute.

Question 5

Construct a 99% confidence interval for the mean pulse rate of all NSCC students and conclude your result in a complete sentence below the R chunk.

# Calculate lower bound of 99% CI and calculate upper bound of 99% CI
t_crit_99 <- qt(0.995, sample_size - 1) # store the critical value for 99%
margin_error_99 <- t_crit_99 * (14 / sqrt(sample_size)) # determine the MOE
ci_lower_99 <- mean_pulse - margin_error_99 # find the lower bound
ci_upper_99 <- mean_pulse + margin_error_99 # find the upper bound

t_crit_99 # printout the tcrit

## [1] 2.715409

margin_error_99 # print the margin of error

## [1] 6.166964

ci_lower_99 # printout the confidence interval lower bound

## [1] 67.30672

ci_upper_99 # printout the confidence interval upper bound

## [1] 79.64065

I am 99% confident that the true mean pulse rate of all NSCC students lies between the calculated values of 67.3 and 79.6 beats per minute.

Question 6

Describe and explain the difference you observe in your confidence interval results for questions 5 and 6.

I am assuming this was a typo and that we were addressing the differences between confidence interval results for questions 4 and 5 in this R-markdown document.

If that is the case, the confidence interval at 99% is wider than the confidence interval at 95%. This happens when using a higher critical value. The basic idea is that in order to increase the likelihood that the mean values will fall within our interval, we increase the size of the interval.

Question 7

In the Fall 2019 semester of the 2019-20 academic year, the average NSCC student took 12.3 credits with \(\sigma = 3.4\). I’m curious if that average differs among NSCC students last year (a sample of which is in the NSCC student dataset). Conduct a hypothesis test by a confidence interval to determine if the average credits differs last year from Fall 2019.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0 : \mu = 12.3\)

\(H_\alpha : \mu \neq 12.3\)

Create confidence interval

# Calculate mean of Credits variable
mean_credits <- mean(nscc_student_data$Credits, na.rm = TRUE) # calculate the mean from the specified dataset
n_credits <- sum(!is.na(nscc_student_data$Credits)) # Calculate sample size of Credits variable
t_crit_credits <- qt(0.975, n_credits - 1) # find the critical value
margin_error_credits <- t_crit_credits * (3.4 / sqrt(n_credits)) # find the MOE
ci_lower_credits <- mean_credits - margin_error_credits # find the lower bound for conf int
ci_upper_credits <- mean_credits + margin_error_credits # find the upper bound for conf int

# Print all calculated answers
mean_credits

## [1] 11.775

n_credits

## [1] 40

t_crit_credits

## [1] 2.022691

margin_error_credits

## [1] 1.087373

ci_lower_credits # Lower bound of 95% CI

## [1] 10.68763

ci_upper_credits# Upper bound of 95% CI

## [1] 12.86237

Make decision to reject H0 or fail to reject H0 based on confidence interval
The confidence interval includes 12.3, which was the initial hypothesis. We have failed to reject the null hypothesis, \(H_0 : \mu = 12.3\). This is not a statistically significant difference in the mean number of credits between student year cohorts.
Write a concluding statement
Given the 95% confidence interval calculated from last year’s NSCC student sample, we do not have enough evidence to support the hypothesis that the number of credits taken is different from the Fall 2019 semester average. The null hypothesis mean at 12.3 was included in the calculated confidence interval.

Question 8

NSCC is investigating whether NSCC students have a higher than average stress level which can be identified by a higher than average standing pulse rate. Conduct a hypothesis test by the p-value method to determine if NSCC students have a higher pulse rate than the national average of 72 bpm for adults. Recall the assumption that \(\sigma = 14\) for NSCC student pulse rates.

Write the hypotheses (You may try to emulate the LaTex format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0 : \mu = 72\)

\(H_\alpha : \mu > 72\)

Calculate p-value of getting sample data by chance

# Calculate mean of PulseRate variable
mean_pulse <- mean(nscc_student_data$PulseRate, na.rm = TRUE)
pulse_sample_size <- sum(!is.na(nscc_student_data$PulseRate))

# Calculate the z-score
pulse_z_score <- (mean_pulse - 72) / (14 / sqrt(pulse_sample_size))
pulse_z_score

## [1] 0.6488857

# Probability of getting that sample data by random chance if pop mean was indeed 72 bpm
p_value <- pnorm(pulse_z_score, lower.tail = FALSE)

# printout the p-value
p_value

## [1] 0.2582061

Make decision to reject H0 or fail to reject H0 at a significance level of 0.05 based on p-value.
The p-value is much higher than the common significance level (0.05), and therefore we have failed to reject the null hypothesis, \(H_0 : \mu = 72\).
Write a concluding statement
For the p-value test at a significance level of 0.05, we have insufficient evidence to demonstrate a higher pulse rate for NSCC students than the national average of 72 bpm. We have thusly failed to reject the null hypothesis and we can reason that the data does not evidence the claim of a higher-than-average pulse rate among NSCC students, despite their obvious signs of stress.