Project #4 - Introduction to Statistical Inference

Purpose

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, and confidence intervals and hypothesis tests.

Question 1

Assume IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. If a person is randomly selected, find each of the requested probabilities, using r chunks below. Here, x, denotes the IQ of the randomly selected person.

P(x > 120)

#We find the probability for an IQ greater than 120, making sure to keep lower.tail = FALSE 
pnorm(q = 120, mean = 100, sd = 15, lower.tail = FALSE)

## [1] 0.09121122

P(x < 120)

#We find the probability for a person with an IQ less than 120, lower.tail manipulations are not necessary.
pnorm(q = 120, mean = 100, sd = 15)

## [1] 0.9087888

Question 2

What is the probability that a random selected student will have an IQ between 80 and 120? (We haven’t explicitly learned this. Think about it though.)

#Finding the probability for both an IQ > 80 and an IQ < 120
pnorm(q = 120, mean = 100, sd = 15)

## [1] 0.9087888

pnorm(q = 80, mean = 100, sd = 15, lower.tail = FALSE)

## [1] 0.9087888

Considering that the probability is the same for a randomly selected person to have an IQ above 80 and below 120, I believe the proability that a randomly selected student will have an IQ between 80 and 120 is 90.88%.

Suppose that a random sample of 12 students is selected. What is the probability that their mean IQ is greater than 120?

#We will store the objects necessary for obtaining standard error of mean, degress of freedom, and the t-score to calculate probability
mu <- 100  
sigma <- 15 
n <- 12  
x <- 120  

#Find the standard error of mean and store it as an object for ease of access 
SEM <- sigma / sqrt(n)

#Find the degree of freedom 
df <- n - 1

#Find the t-score 
t <- (x - mu) / SEM

#Using the pt() function we will calculate the probability of their mean IQ being greater than 120. 
prob <- 1 - pt(t, df)

#To print answer...
print(prob)

## [1] 0.0003709099

Question 3

Load and store the sample NSCC Student Dataset using the read.csv() function. Find the mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.

# Store the NSCC student dataset in environment
nscc <- read.csv("nscc_student_data.csv")

# Find the mean pulse rate of this sample
mean(nscc$PulseRate, na.rm = TRUE)

## [1] 73.47368

# Find the sample size of pulse rates (hint: its how many non-NA values are there)
#sum(!is.na() will give us the actual sample size not including NA
sum(!is.na(nscc$PulseRate))

## [1] 38

Do you think it is likely or unlikely that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found?

I find it unlikely that the population mean pulse rate for all NSCC students is exactly equal to the sample mean found.

Question 4

If we assume the mean pulse rate for all NSCC students is \(\sigma = 14\), construct a 95% confidence interval for the mean pulse rate of all NSCC students and conclude in a complete sentence below.
(Note: we can create a valid confidence interval here since n > 30)

# Store mean
mean <- mean(nscc$PulseRate, na.rm = TRUE)


# Calculate lower bound of 95% CI, we will use the form \mu + or - 1.96*(sd/sqrt(n)) for both lower and upper bound. 
mean - 1.96*(14/sqrt(38))

## [1] 69.02233

# Calculate upper bound of 95% CI
mean + 1.96*(14/sqrt(38))

## [1] 77.92504

I am 95% confident that… the mean pulse rate for all NSCC students is between 69.02 < x < 77.93

Question 5

Construct a 99% confidence interval for the mean pulse rate of all NSCC students and conclude your result in a complete sentence below the R chunk.

# Calculate lower bound of 99% CI
mean - 2.56*(14/sqrt(38))

## [1] 67.65967

# Calculate upper bound of 99% CI
mean + 2.56*(14/sqrt(38))

## [1] 79.2877

I am 99% confident that… the mean pulse rate of all NSCC students is between 67.66 and 79.29.

Question 6

Describe and explain the difference you observe in your confidence interval results for questions 5 and 6.

The difference between the range that the 95% and 99% confidence intervals have outputted is that the 99% confidence interval increases the width of the range that the mean might fall between. It accounts for more standard errors.

Question 7

In the Fall 2019 semester of the 2019-20 academic year, the average NSCC student took 12.3 credits with \(\sigma = 3.4\). I’m curious if that average differs among NSCC students last year (a sample of which is in the NSCC student dataset). Conduct a hypothesis test by a confidence interval to determine if the average credits differs last year from Fall 2019.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0: \mu = 12.3\) \(H_A: \mu ≠ 12.3\)

Create confidence interval

# Calculate mean of Credits variable
mean(nscc$Credits)

## [1] 11.775

# Calculate sample size of Credits variable
sum(!is.na(nscc$Credits))

## [1] 40

# Lower bound of 95% CI
11.775 - 1.96*(3.4/sqrt(40))

## [1] 10.72133

# Upper bound of 95% CI
11.775 + 1.96*(3.4/sqrt(40))

## [1] 12.82867

Make decision to reject H0 or fail to reject H0 based on confidence interval

Based on the confidence interval of 10.72 < x < 12.83, we fail to reject the null hypothesis.

Write a concluding statement

There is not sufficient evidence to conclude that the mean credits of 2018 NSCC students is any different than the mean credits of 2019 NSCC students.

Question 8

NSCC is investigating whether NSCC students have a higher than average stress level which can be identified by a higher than average standing pulse rate. Conduct a hypothesis test by the p-value method to determine if NSCC students have a higher pulse rate than the national average of 72 bpm for adults. Recall the assumption that \(\sigma = 14\) for NSCC student pulse rates.

Write the hypotheses (You may try to emulate the LaTex format I used in lecture notes. Otherwise, just give your best effort.)

\(H_0: \mu = 72\) \(H_A: \mu > 72\)

Calculate p-value of getting sample data by chance

# Calculate mean of PulseRate variable, we also store the mean as an object for ease of use in calculating pnorm...
mean_pr <- mean(nscc$PulseRate, na.rm = TRUE)

# Probability of getting that sample data by random chance if pop mean was indeed 72bpm
#We will use sum(!is(na)) to ensure that the n variable is accurate (sample size)
sum(!is.na(nscc$PulseRate))

## [1] 38

#To find probability if pop mean was indeed 72 bpm. 
pnorm(q = 72, mean = mean_pr, sd = 14/sqrt(38))

## [1] 0.2582061

Make decision to reject H0 or fail to reject H0 at a significance level of 0.05 based on p-value.

Based on the the p value of .2582 > .05, we fail to reject the null hypothesis.

Write a concluding statement

There is not sufficient evidence to support the claim that there is an alternative hypothesis.