Project #4 - Introduction to Statistical Inference

Purpose

In this project, students will demonstrate their understanding of the normal distribution, sampling distributions, and confidence intervals and hypothesis tests.

Question 1

Assume IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. If a person is randomly selected, find each of the requested probabilities. Here, x, denotes the IQ of the randomly selected person.

P(x > 120)

#This code finds the probability of a rondomly selected person with IQ greater than 120
pnorm(120, 100, 15, lower.tail = FALSE)

## [1] 0.09121122

The probability of getting an IQ greater than 120 is 0.0912.

P(x < 120)

#This code finds the probality of a rondomly selected person with IQ less than 120
pnorm(120, 100, 15)

## [1] 0.9087888

The probability of a randomly selected person getting an IQ less than 120 is 0.9088.

Question 2

What is the probability that a random selected student will have an IQ between 80 and 120? (We haven’t explicitly learned. Think about it though.)

#This code finds the probability of a random selected student who will have an IQ between 80 and 120
middle.two.tail.n <- function(x,y,m,sd){pnorm(x,m,sd)-pnorm(y,m,sd)}

middle.two.tail.n(120,79,100,15)

## [1] 0.8280321

The probability of a random selected student having an IQ between 80 and 120 is 0.8280.

Suppose that a random sample of 12 students is selected. What is the probability that their mean IQ is greater than 120?

#This code finds the probability of a random sample of students having an IQ greater than 120
pnorm(120, 100,15/sqrt(12), lower.tail = FALSE)

## [1] 1.929808e-06

The probability of a random sample of 12 students having mean IQ greater than 120 is 1.929808e-06, which is nearly 0% probability.

Question 3

Load and store the sample NSCC Student Dataset using the read.csv() function. Find the mean and sample size of the PulseRate variable in this dataset and answer the question that follows below.

#Store the NSCC student dataset in the environment
#This code loads the dataset and stores it as an object in the environment 
library (readxl)
nscc_student_data <- read_excel("~/Desktop/Stats/nscc_student_data.xlsx")
View(nscc_student_data)

nscc_student_data <- nscc_student_data

# Find the mean pulse rate of this sample
mean(nscc_student_data$PulseRate,na.rm = TRUE)

## [1] 73.47368

# Find the sample size of pulse rates (hint: its how many non-NA values are there)
sample(nscc_student_data$PulseRate)

##  [1] 80 74 71 60 50 96 85 56 80 62 72 75 72 80 66 65 70 66 61 92 66 88 60
## [24] 64 98 92 66 60 NA 92 80 69 60 89 65 89 NA 87 64 70

table(is.na(nscc_student_data$PulseRate))

## 
## FALSE  TRUE 
##    38     2

The mean of the Pulse Rate variable is 73.4737 and the sample size is 38.

Do you think it is likely or unlikely that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found?

Its unlikely but that the population mean pulse rate for all NSCC students is exactly equal to that sample mean found, but it is close to the population mean.

Question 4

If we assume the mean pulse rate for all NSCC students is \(\sigma = 14\), construct a 95% confidence interval for the mean pulse rate of all NSCC students and conclude in a complete sentence below.
(Note: we can create a valid confidence interval here since n > 30)

# Store mean
mean_nscc <-mean(nscc_student_data$PulseRate, na.rm = TRUE)

# Calculate lower bound of 95% CI
mean_nscc-1.96*(14/sqrt(40))

## [1] 69.13504

# Calculate upper bound of 95% CI
mean_nscc+1.96*(14/sqrt(40))

## [1] 77.81233

I am 95% confident that the mean pulse rate for all NSCC students is between 69.14 and 77.81.

Question 5

Construct a 99% confidence interval for the mean pulse rate of all NSCC students and conclude your result in a complete sentence below the R chunk.

# Calculate lower bound of 99% CI
mean_nscc-2.58*(14/sqrt(40))

## [1] 67.76261

# Calculate upper bound of 99% CI
mean_nscc+2.58*(14/sqrt(40))

## [1] 79.18476

I am 99% confident that the population mean for the pulse rate of all NSCC students is between 67.76 and 79.19.

Question 6

Describe and explain the difference you observe in your confidence intervals for questions 5 and 6.

The confidence interval will widen as the confidence interval percentage increases.

Question 7

In the Fall 2009 semester of the 2009-10 academic year, the average NSCC student took 12.3 credits with \(\sigma = 3.4\). I’m curious if that average differs among NSCC students last year (a sample of which is in the NSCC student dataset). Conduct a hypothesis test by a confidence interval to determine if the average credits differs last year from Fall 2009.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)

Writing the hypotheses \(H_0: \mu_{2009} = 12.3\)
\(H_A: \mu_{2009} \neq 12.3\)

Create confidence interval

# Calculate mean of Credits variable
#This code will find the mean of credits variable
mean(nscc_student_data$Credits)

## [1] 11.775

# Calculate sample size of Credits variable
#This code will find sample size of Credits variable
sample(nscc_student_data$Credits)

##  [1] 13 15 13 10 14  4 12 16 13 10  3 16 13 12  8 15 12  9 10 12 10  7 15
## [24] 16  6  9 15  6  9 13 15 12 15 13 13 13 15 16 10 13

# Lower bound of 95% CI
11.78-1.96*(3.4/sqrt(40))

## [1] 10.72633

# Upper bound of 95% CI
11.78+1.96*(3.4/sqrt(40))

## [1] 12.83367

The 95% confidence interval is between 10.73 and 12.83.

Make decision to reject H0 or fail to reject H0 based on confidence interval

We fail to reject the H0 hypothesis because 12.3 is within the confidence interval of 10.73-12.83.

Write a concluding statement

There is not sufficient data to conclude that the average numbers of credits has chnaged from 2009.

Question 8

NSCC is investigating whether NSCC students have a higher than average stress level which can be identified by a higher than average standing pulse rate. Conduct a hypothesis test by a p-value to determine if NSCC students have a higher pulse rate than the national average of 72 bpm for adults. Recall the assumption that \(\sigma = 14\) for NSCC student pulse rates.

Write the hypotheses (Try to emulate the “latex” format I used in lecture notes. Otherwise, just give your best effort.)
Ho: u = 72 HA: u > 72 Writing the hypotheses. \(H_0: \mu = 72\) \(H_A: \mu > 72\)
Calculate p-value of getting sample data by chance

# Calculate mean of PulseRate variable
mean(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 73.47368

#This code will find sample size of Pulse Rate variable
sample(nscc_student_data$PulseRate)

##  [1] 66 92 70 60 72 80 NA 60 62 87 92 56 70 50 NA 71 74 85 66 80 98 61 64
## [24] 66 60 65 96 64 88 72 60 89 66 69 80 75 80 92 89 65

# Probability of getting that sample data by random chance if pop mean was indeed 72bpm
pnorm(73.47368,mean= 72, sd=14/sqrt(38))

## [1] 0.7417933

#This will find the p-value 
t.test(nscc_student_data$PulseRate)

## 
##  One Sample t-test
## 
## data:  nscc_student_data$PulseRate
## t = 36.202, df = 37, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  69.36141 77.58596
## sample estimates:
## mean of x 
##  73.47368

The p-value of getting the sample data by chance is 2.2e-16, therefore p<0.05.

Make decision to reject H0 or fail to reject H0 at a significance level of 0.05 based on p-value.

Since p<0.05 we reject the null.

Write a concluding statement
Based on the data, there is sucffient evidence that NSCC students have a higher than average stress level.