Project #4 - Introduction to Statistical Inference

Purpose

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, and confidence intervals and hypothesis tests.

Question 1

Assume IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. If a person is randomly selected, find each of the requested probabilities. Here, x, denotes the IQ of the randomly selected person.

P(x > 120)

#Calculate the probability that a random person has an IQ greater than 120.

pnorm(120,100,15, lower.tail = FALSE)

## [1] 0.09121122

The probability that a random person has an IQ greater than 120 is 9%.

P(x < 120)

#Calculate the probability that a random person has an IQ less than 120.

pnorm(120,100,15)

## [1] 0.9087888

The probability that a random person has an IQ less than 120 is 91%.

Question 2

What is the probability that a random selected student will have an IQ between 80 and 120? (We haven’t explicitly learned. Think about it though.)

#Calculate the probability a random student will have an IQ above 80.

pnorm(80,100,15)

## [1] 0.09121122

#Calculate the probability a random student will have an IQ below 120.

pnorm(120,100,15, lower.tail = FALSE)

## [1] 0.09121122

#Calculate the probability a random student will have an IQ between 80 and 120.

1-0.09*2

## [1] 0.82

Both the lower tail and upper tail represent 9% each and 18% cumulatively. Therefore, the probability a student’s IQ falls between these two tails is 82%.

Suppose that a random sample of 12 students is selected. What is the probability that their mean IQ is greater than 120?

#Calculate the probability that 12 random students have a mean IQ greater that 120.

pnorm(120,100,(15/sqrt(12)), lower.tail = FALSE)

## [1] 1.929808e-06

The probability that 12 random students have a mean IQ greater than 120 is 0.0000019% or 0.

Question 3

Load and store the sample NSCC Student Dataset using the read.csv() function. Find the mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.

# Store the NSCC student dataset in environment.

nscc_student_data <- read.csv("C:/Users/jperry23/Downloads/nscc_student_data.csv")
View(nscc_student_data)

# Find the mean pulse rate of this sample.

mean(nscc_student_data$PulseRate,na.rm = TRUE)

## [1] 73.47368

# Find the sample size of pulse rates (hint: its how many non-NA values are there).

table(is.na(nscc_student_data$PulseRate))

## 
## FALSE  TRUE 
##    38     2

The mean pulse rate of NSCC students is 73bpm. The sample size of pulse rates is 38, excluding all NA’s.

Do you think it is likely or unlikely that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found?

It is unlikely that the population mean pulse rate is exactly equal to the sample mean pulse rate because the sample mean only represents 38 students, which varies significantly from the actual number of students attending NSCC.

Question 4

If we assume the mean pulse rate for all NSCC students is \(\sigma = 14\), construct a 95% confidence interval for the mean pulse rate of all NSCC students and conclude in a complete sentence below.
(Note: we can create a valid confidence interval here since n > 30)

# Store the mean of NSCC student pulse rates.

meanPulseRate<-mean(nscc_student_data$PulseRate,na.rm = TRUE)

# Store the standard deviation of NSCC student pulse rates.

sdPulseRate<-14

# Calculate lower bound of 95% CI.

meanPulseRate - 1.96*(sdPulseRate/sqrt(38))

## [1] 69.02233

# Calculate upper bound of 95% CI.

meanPulseRate + 1.96*(sdPulseRate/sqrt(38))

## [1] 77.92504

I am 95% confident that the mean of NSCC student pulse rates falls between 69bpm and 77bpm.

Question 5

Construct a 99% confidence interval for the mean pulse rate of all NSCC students and conclude your result in a complete sentence below the R chunk.

# Calculate lower bound of 99% CI.

meanPulseRate - 2.58*(sdPulseRate/sqrt(38))

## [1] 67.61425

# Calculate upper bound of 99% CI.

meanPulseRate + 2.58*(sdPulseRate/sqrt(38))

## [1] 79.33312

I am 99% confident that the mean of NSCC student pulse rates falls between 67bpm and 79bpm.

Question 6

Describe and explain the difference you observe in your confidence intervals for questions 5 and 6.

If I want to be 95% confident I have found the sample mean of pulse rates, I can use the smaller range of pulse rates (69-77); however, if I want to be 99% (more) confident, I would use a slightly wider range (67-79). There is not a significant difference in pulse ranges, despite the difference in confidence intervals.

Question 7

In the Fall 2009 semester of the 2009-10 academic year, the average NSCC student took 12.3 credits with \(\sigma = 3.4\). I’m curious if that average differs among NSCC students last year (a sample of which is in the NSCC student dataset). Conduct a hypothesis test by a confidence interval to determine if the average credits differs last year from Fall 2009.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)

\(H0: \mu = 12.3\:credits\)

\(HA: \mu \neq 12.3\:credits\)

Create confidence interval

# Calculate mean of Credits variable.

mean(nscc_student_data$Credits)

## [1] 11.775

#Store mean of Credits variable.

meanCredits<-mean(nscc_student_data$Credits)

#Calculate and store standard deviation of Credit variable.

sdCredits<-sd(nscc_student_data$Credits)

# Calculate sample size of Credits variable.

table(is.na(nscc_student_data$Credits))

## 
## FALSE 
##    40

The sample mean of NSCC student credits taken last year is ~ 11.8. This is based on the sample population of 40.

# Lower bound of 95% CI

meanCredits - 1.96*(sdCredits/sqrt(40))

## [1] 10.73056

# Upper bound of 95% CI

meanCredits + 1.96*(sdCredits/sqrt(40))

## [1] 12.81944

I am 95% confident that the number of credits taken by the sampled NSCC students falls between 10.7 and 12.8.

Make decision to reject H0 or fail to reject H0 based on confidence interval

We fail to reject H0.

Write a concluding statement

According to the confidence interval based hypothesis testing, we can only reject the null hypothesis if it falls outside the bounds of our confidence interval. In this case, we fail to reject H0 because 12.3 falls within the range of our confidence interval, 10.7 < 12.3 > 12.8

Question 8

NSCC is investigating whether NSCC students have a higher than average stress level which can be identified by a higher than average standing pulse rate. Conduct a hypothesis test by a p-value to determine if NSCC students have a higher pulse rate than the national average of 72 bpm for adults. Recall the assumption that \(\sigma = 14\) for NSCC student pulse rates.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)

\(H0: PulseRate \le 72bpm\)

\(HA: PulseRate > 72bpm\)

Calculate p-value of getting sample data by chance

# Calculate mean of PulseRate variable.

mean(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 73.47368

The mean of the sampled NSCC students pulse rates is 73bpm.

# Probability of getting that sample data by random chance if pop mean was indeed 72bpm.

pnorm(73.47,72,sdPulseRate/sqrt(38), lower.tail = FALSE)

## [1] 0.2587307

The probability of getting a mean of 73bpm randomly from the sample data, if the population mean is 72bpm is ~ 26%.

Make decision to reject H0 or fail to reject H0 at a significance level of 0.05 based on p-value.