Project #4 - Sampling Distributions

Purpose

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, and confidence intervals and hypothesis tests.

Question 1

Assume IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. If a person is randomly selected, find each of the requested probabilities. Here, x, denotes the IQ of the randomly selected person.
a. P(x > 140)

# Use the pnorm function to find probabilities in a normal distribution.

pnorm(140, mean = 100, sd = 15, lower.tail = FALSE)

## [1] 0.003830381

The probability of a randomly selected person having an IQ above 140 is 0.0038.

P(x < 110)

pnorm(110, mean = 100, sd = 15, lower.tail = TRUE)

## [1] 0.7475075

The probability of a randomly selected person having an IQ less than 110 is 0.7475.

What is the probability that a random selected student will have an IQ between 80 and 120?

# To calculate this probability, I'll find the probability of an IQ below 120 and subtract the probability of an IQ below 80.

pnorm(120, mean = 100, sd = 15, lower.tail = TRUE) - 
pnorm( 80, mean = 100, sd = 15, lower.tail = TRUE)

## [1] 0.8175776

The probability of a randomly selected person having an IQ between 80 and 120 is 0.8176. This seems reasonable considering that approximately 95% of IQ’s will fall within two standard deviations of the population mean of 100, which is the range 70 to 130.

Question 2

Continue to assume IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. If a person is randomly selected, find each of the requested probabilities.
a. What is the probability of a randomly selected student will have an IQ greater than 110?

#This is the complement of question 1b, so we can subtract that answer from 1.

1 - pnorm(110, mean = 100, sd = 15, lower.tail = TRUE)

## [1] 0.2524925

The probability that a randomly selected student has an IQ greater than 110 is 0.2525.

Suppose that a random sample of 12 students is selected. What is the probability that their mean IQ is greater than 110?

To find this probability, use the pnorm function with the same mean, replacing the standard deviation with the sample mean standard error, which is the population standard deviation of 15 divided by the square root of the sample size of 12. \(SE = \sigma / \sqrt{n} = 15/\sqrt{12}\)

pnorm(110, mean = 100, sd = (15/sqrt(12)), lower.tail = FALSE)

## [1] 0.01046067

I can check this by calculating the z-score and finding the probability from the standard normal distribution. \(z* = \frac{(x - \mu)}{SE}\)

#First, calculate z*
z_star <- (110 - 100)/(15/sqrt(12))
z_star
## [1] 2.309401

#Then, use pnorm to find the probability
pnorm(z_star, mean = 0, sd = 1, lower.tail = FALSE)
## [1] 0.01046067

The probability of the mean of 12 students’ IQ’s being more than 110 is 0.0104, which is smaller than the probability of any individual IQ being over 110, 0.2525.

Question 3

Load and store the sample NSCC Student Dataset using the read.csv() function. Find the mean, standard deviation, and sample size of the PulseRate variable in this dataset. Do you think it is likely or unlikely that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found?

#Load the data into the object called nscc_students
nscc_students <- read.csv("C:/Users/Henhoag/Desktop/Math143H/Projects/nscc_student_data.csv")

#First, let's get rid of the NA's. I'll store the subset that excludes observations that are missing pulse rate data in the object called PulseRateStudents 
PulseRateStudents <- subset(nscc_students, nscc_students$PulseRate != "NA")

#Find the mean pulse rate.
mean(PulseRateStudents$PulseRate)
## [1] 73.47368

#Calculate the standard deviation of the pulse rates.
sd(PulseRateStudents$PulseRate)
## [1] 12.51105

#The sample size is just the number of rows in the PulseRateStudents object. I'll store it in the variable numberPulseRate.
numberPulseRate <- nrow(PulseRateStudents)
numberPulseRate
## [1] 38

The mean pulse rate of the 38 students is 73.47 bpm, with a standard deviation of 12.51 bpm. I think it is unlikely that the population mean pulse rate for all NSCC students is equal to this sample mean pulse rate.

Question 4

Construct a 95% confidence interval for the mean pulse rate of all NSCC students and conclude your result in a complete sentence. (Note: we can create a valid confidence interval here since n > 30)

95% confidence intervals are calculated with the formula: \(\bar{x}\pm 1.96\cdot\sigma/\sqrt{n}\)

# Store mean and std dev

meanPulseRate <- mean(PulseRateStudents$PulseRate)

sdPulseRate <- sd(PulseRateStudents$PulseRate)

# Calculate lower bound of 95% CI
meanPulseRate - 1.96*(sdPulseRate/sqrt(numberPulseRate))

## [1] 69.49575

# Calculate upper bound of 95% CI
meanPulseRate + 1.96*(sdPulseRate/sqrt(numberPulseRate))

## [1] 77.45162

Based on these data, we are about 95% confident that the average pulse rate of NSCC students is larger than 69.50 bpm but less than 77.45 bpm.

Question 5

Construct a 99% confidence interval for the mean pulse rate of all NSCC students and conclude your result in a complete sentence below the R chunk.

99% confidence intervals are calculated with the formula: \(\bar{x}\pm 2.58\cdot\sigma/\sqrt{n}\)

# Calculate lower bound of 99% CI
meanPulseRate - 2.58*(sdPulseRate/sqrt(numberPulseRate))

## [1] 68.23742

# Calculate upper bound of 99% CI
meanPulseRate + 2.58*(sdPulseRate/sqrt(numberPulseRate))

## [1] 78.70995

Based on these data, we are about 99% confident that the average pulse rate of NSCC students is larger than 68.24 bpm but less than 78.71 bpm.

Question 6

Describe and explain the difference you observe in your confidence intervals for questions 4 and 5.
The 95% confidence interval is (69.50, 77.45), which is a range of approximately 8 bpm. The 99% confidence interval is (68.23, 78.71), corresponding to a range of approximately 10.5 bpm. As we increase the confidence interval, we are more confident that the true population mean pulse rate is within the interval.

Question 7

In the Fall 2009 semester of the 2009-10 academic year, the average NSCC student took 12.1 credits. I’m curious if that average differs among NSCC students last year (a sample of which is in the NSCC student dataset). Conduct a hypothesis test by a confidence interval to determine if the average credits differs last year from Fall 2009.

Write the hypotheses (Try to emulate the “LaTeX” format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0\): \(\mu_{LastYear} = \mu_{Fall2009} =\) 12.1 credits
\(H_A\): \(\mu_{LastYear} \ne\) 12.1 credits

Create confidence interval

# Store mean of Credits variable

meanCredits <- mean(nscc_students$Credits)

# Store standard deviation of Credits variable
sdCredits <- sd(nscc_students$Credits)

# Store sample size of Credits variable
#Since I don't see any NA entries in the Credits column, the sample size is the same as the number of students surveyed (40).
numberCredits <- nrow(nscc_students)

# Lower bound of 95% CI
meanCredits - 1.96*(sdCredits/sqrt(numberCredits))

## [1] 10.73056

# Upper bound of 95% CI
meanCredits + 1.96*(sdCredits/sqrt(numberCredits))

## [1] 12.81944

Make decision to reject \(H_0\) or fail to reject \(H_0\) based on confidence interval

Since the Fall 2009 mean of 12.1 credits falls within the confidence interval of (10.73, 12.82) credits, we fail to reject the \(H_0\).

Write a concluding statement
There is not sufficient evidence to support the claim that last year’s average number of credits (11.8) differs from the Fall 2009 average of 12.1 credits.

Question 8

NSCC is investigating whether NSCC students have a higher than average stress level which can be identified by a higher than average standing pulse rate. Conduct a hypothesis test by a p-value to determine if NSCC students have a higher pulse rate than the national average of 72 bpm for adults.

Write the hypotheses (Try to emulate the “LaTeX” format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0\): \(\mu_{NSCC}\) = \(\mu_{Adult}\)= 72 bpm
\(H_A\): \(\mu_{NSCC} \gt\) 72 bpm

Calculate p-value of getting sample statistics by chance

# Probability of getting sample data by random chance if mean was indeed 72bpm

pnorm(meanPulseRate, mean = 72, sd = (sdPulseRate/sqrt(numberPulseRate)), lower.tail = FALSE)

## [1] 0.2338856

We calculated a sample mean of 73.47 bpm in Question 3. The p-value, the probability of getting this mean for a sample size of 38 people, is 0.2339.

Make decision to reject \(H_0\) or fail to reject \(H_0\) at a significance level of 0.05 based on p-value.
Since the p-value of 0.2339 is greater than the significance level of 0.05, we fail to reject the \(H_0\).
Write a concluding statement
When standing pulse rates are used as a proxy for stress, there is not sufficient evidence to support the claim that NSCC students have higher than average stress levels.