Project #4 - Introduction to Statistical Inference

Purpose

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, and confidence intervals and hypothesis tests.

Question 1

Assume IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. If a person is randomly selected, find each of the requested probabilities. Here, x, denotes the IQ of the randomly selected person.

P(x > 120)

# Using the pnorm function to find the probability of an IQ of 120 or higher (normally distributed data):

pnorm(120, mean=100, sd=15, lower.tail=FALSE)

## [1] 0.09121122

The probability of a randomly selected person having an IQ of 120 or higher is 0.0912.

P(x < 120)

# Using the pnorm function to find the probability of an IQ of 120 or lower:

pnorm(120, mean=100, sd=15)

## [1] 0.9087888

The probability of a randomly selected person having an IQ of 120 or lower 0.909.

Question 2

What is the probability that a random selected student will have an IQ between 80 and 120? (We haven’t explicitly learned. Think about it though.)

# Using the pnorm function to find the probability of an IQ of 80 or lower, and subtracting that number from 0.909 (the probability of an IQ of 120 or lower):


0.909 - pnorm(80, mean = 100, sd = 1)

## [1] 0.909

The probability of a randomly selected person having an IQ between 80 and 120 is 0.818.

Suppose that a random sample of 12 students is selected. What is the probability that their mean IQ is greater than 120?

To calculate this proability, I will need to find the standard error using the formula \(SE = \sigma/\sqrt{n}\). I will then use the pnorm function as I did in the previous questions.

pnorm(120, mean=100, sd=(15/sqrt(12)), lower.tail = FALSE)

## [1] 1.929808e-06

The probability of a random sample of 12 students having a mean IQ score of 120 or higher is 0.00000193.

Question 3

Load and store the sample NSCC Student Dataset using the read.csv() function. Find the mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.

# Store the NSCC student dataset in environment

nscc_student_data <- read.csv("C:/Users/jessi/Music/Statistics/nscc_student_data.csv")

# Find the mean pulse rate of this sample

mean(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 73.47368

# Find the sample size of pulse rates (hint: its how many non-NA values are there)

table(is.na(nscc_student_data$PulseRate))

## 
## FALSE  TRUE 
##    38     2

The mean pulse rate of the 38 sampled students is 73.47.

Do you think it is likely or unlikely that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found?

It is extremely unlikely that the population mean of NSCC students’ pulse rates is the exact same as the sample mean.

Question 4

If we assume the mean pulse rate for all NSCC students is \(\sigma = 14\), construct a 95% confidence interval for the mean pulse rate of all NSCC students and conclude in a complete sentence below.
(Note: we can create a valid confidence interval here since n > 30)

# Store mean

mean_pulse_rate <- mean(nscc_student_data$PulseRate, na.rm=TRUE)

# Calculate lower bound of 95% CI

mean_pulse_rate-1.96*(14/sqrt(38))

## [1] 69.02233

# Calculate upper bound of 95% CI

mean_pulse_rate+1.96*(14/sqrt(38))

## [1] 77.92504

I am 95% confident that the mean of all NSCC students’ pulse rates is between 69.02 and 77.93.

Question 5

Construct a 99% confidence interval for the mean pulse rate of all NSCC students and conclude your result in a complete sentence below the R chunk.

# Calculate lower bound of 99% CI

mean_pulse_rate-2.58*(14/sqrt(38))

## [1] 67.61425

# Calculate upper bound of 99% CI

mean_pulse_rate+2.58*(14/sqrt(38))

## [1] 79.33312

I am 99% confident that the mean of all NSCC students’ pulse rates is between 67.61 and 77.33.

Question 6

Describe and explain the difference you observe in your confidence intervals for questions 5 and 6.

The 99% confidence interval is wider than the 95% confidence interval. This is always the case, and is inherently logical, because a wider parameter means that there is a higher likelihood of a value falling within those parameters.

Question 7

In the Fall 2009 semester of the 2009-10 academic year, the average NSCC student took 12.3 credits with \(\sigma = 3.4\). I’m curious if that average differs among NSCC students last year (a sample of which is in the NSCC student dataset). Conduct a hypothesis test by a confidence interval to determine if the average credits differs last year from Fall 2009.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0: \mu_{2018} = 12.3\)

\(H_A: \mu_{2018} \neq 12.3\)

Create confidence interval

# Calculate mean of Credits variable

mean(nscc_student_data$Credits)

## [1] 11.775

# Calculate sample size of Credits variable

table(is.na(nscc_student_data$Credits))

## 
## FALSE 
##    40

Of the 40 students sampled, the mean number of credits taken is 11.8.

# Lower bound of 95% CI

11.8-1.96*(3.4/sqrt(40))

## [1] 10.74633

# Upper bound of 95% CI

11.8+1.96*(3.4/sqrt(40))

## [1] 12.85367

Make decision to reject H0 or fail to reject H0 based on confidence interval

We are 95% confident that the 2018 NSCC students’ credits mean is between 10.7 and 12.9. Because the 2009 mean of 12.3 is within the bounds of our confidence interval, we therefore fail to reject the null hypothesis (\(H_0\)).

Write a concluding statement

There is not sufficient evidence to say that the mean number of credits of NSCC students in 2018 is any different than NSCC students in 2009.

Question 8

NSCC is investigating whether NSCC students have a higher than average stress level which can be identified by a higher than average standing pulse rate. Conduct a hypothesis test by a p-value to determine if NSCC students have a higher pulse rate than the national average of 72 bpm for adults. Recall the assumption that \(\sigma = 14\) for NSCC student pulse rates.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0: \mu_{PR} = 72\)

\(H_A: \mu_{PR} > 72\)

Calculate p-value of getting sample data by chance

# Calculate mean of PulseRate variable

mean(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 73.47368

# Probability of getting that sample data by random chance if pop mean was indeed 72bpm

pnorm(73.47, mean=72, sd=(14/sqrt(38)), lower.tail=FALSE)

## [1] 0.2587307

The p-value of getting the same sample data by random chance is 0.259.

Make decision to reject H0 or fail to reject H0 at a significance level of 0.05 based on p-value.

Since p > 0.05, we fail to reject \(H_0\).

Write a concluding statement

There is not sufficient evidence to say that NSCC students have pulse rates (and therefore stress rates) that are any higher than the national average.