Useful Functions

Before we get started with the exercise questions, here are functions and equations that we will use later.

How to compute the Z-score.

\[\frac{x-\bar{x}}{\sigma^2}\]

## Ex: Compute the z-score of 3.4, given a population mean of 4.8 and standard deviation of 1.3.
z_score <- function(x, mean, sd){
  (x - mean)/sd  
}
z_score(x=3.4, mean=4.8, sd=1.3)

## [1] -1.076923

Function to visualize a normal distribution.

normal_area <- function(mean = 0, sd = 1, lb, ub, acolor = "lightblue", ...) {
  
  x_axis <- seq(mean - 3.3 * sd, mean + 3.3 * sd, length = 100) 
  
  if (missing(lb) & missing(ub)){
    lb_fill <- min(x_axis)
    ub_fill <- max(x_axis)
    title <- paste("100%")
    caption <- ""
  }
  else if (missing(lb)) {
    lb_fill <- min(x_axis)
    ub_fill <- ub
    area <- round(pnorm(ub, mean, sd)*100,2)
    ubz <- paste(round(z_score(ub, mean, sd), 3))
    title <- paste(area, "%\n", "x < ", ub, sep="")
    caption <- paste("z <", ubz)
  }
  else if (missing(ub)) {
    ub_fill <- max(x_axis)
    lb_fill <- lb
    area <- round(pnorm(lb, mean, sd, lower.tail = FALSE)*100,2)
    lbz <- paste(round(z_score(lb, mean, sd), 3))
    title <- paste(area, "%\n", "x > ", lb, sep="")
    caption <- paste("z >", lbz)
  }
  else if (lb <= ub) {
    ub_fill <- ub
    lb_fill <- lb
    ubn <- pnorm(ub, mean, sd)
    lbn <- pnorm(lb, mean, sd)
    area <- round((ubn - lbn)*100, 2)
    ubz <- paste(round(z_score(ub, mean, sd), 3))
    lbz <- paste(round(z_score(lb, mean, sd), 3))
    title <- paste(area, "%\n", lb," < x < ", ub, sep="")
    caption <- paste(lbz, "< z <", ubz)
  }
  else { # ub < lb
    # Note that this gets confusing because the "upper" boundry
    # is actually less than, and below, the "lower" boundry.
    ub_fill <- ub
    lb_fill <- lb
    ubn <- pnorm(ub, mean, sd)
    lbn <- pnorm(lb, mean, sd, lower.tail = FALSE)
    area <- round((ubn + lbn)*100, 2)
    ubz <- paste(round(z_score(ub, mean, sd), 3))
    lbz <- paste(round(z_score(lb, mean, sd), 3))
    title <- paste(area, "%\nx < ", ub,", x > ", lb, sep="")
    caption <- paste("z <", ubz, ", z >", lbz)    
  }

  sub_title <- paste("μ:", mean, ", σ:", sd)
  
  plot(x_axis, dnorm(x_axis, mean, sd), type = "n", ylab = "Probability", xlab="x", main = title, sub=sub_title)
  mtext(caption,side = 3)
  
  if (lb_fill <= ub_fill){ # normal case, inner 
    fill <- seq(lb_fill, ub_fill, length = 100)      
    y <- dnorm(fill, mean, sd)
    polygon(c(lb_fill, fill, ub_fill), c(0, y, 0), col = acolor)
  }
  else{ # we want the tails, fill the outside
    # Again, Note that this gets confusing because the "upper" boundry
    # is actually less than, and below, the "lower" boundry.
    fill1 = seq(min(x_axis), ub_fill, length = 50)
    y1 <- dnorm(fill1, mean, sd)
    polygon(c(min(x_axis), fill1, ub_fill), c(0, y1, 0), col = acolor)
    fill2 = seq(lb_fill, max(x_axis), length = 50)
    y2 <- dnorm(fill2, mean, sd)
    polygon(c(lb_fill, fill2, max(x_axis)), c(0, y2, 0), col = acolor)
  }
  
  lines(x_axis, dnorm(x_axis, mean, sd), type = "l", ...)
  
}

Exercise 1:

Class scores are normally distributed with a μ= 100 and σ = 16

a) If a student has a score of 125, what percentage of students have higher scores?

# Let us calculate the z-score.
zs <- z_score(125, mean = 100, sd = 16)
zs

## [1] 1.5625

# Now calculate the percentage above that z-score.
pnorm(zs, lower.tail = FALSE)

## [1] 0.05908512

# Or, we can do it all at once:
pnorm(125, mean = 100, sd = 16, lower.tail = FALSE)

## [1] 0.05908512

# Graph it.
normal_area(mean=100, sd=16, lb=125)

# Answer = 5.21%

b) If a student has a score of 90, what percentage of students have higher scores?

Same as before, x >= 90.

# Let us calculate the z-score so we can show it in our work.
z_score(90, mean = 100, sd = 16)

## [1] -0.625

# Or, we can do it all at once:
pnorm(90, mean = 100, sd = 16, lower.tail = FALSE)

## [1] 0.7340145

# Graph it.
normal_area(mean=100, sd=16, lb=90)

# Answer = 71.31%

Exercise 2:

What is the one-sided P-value for a Zstat of 1.52? What is the two-sided p-value?

# We want to know the percentage greater than a z-score of 1.52.
pnorm(1.52, lower.tail = FALSE)

## [1] 0.06425549

# Graph it.
normal_area(mean=0, sd=1, lb=1.52)

# Answer = 0.0643

# We could just multiply our previous answer by 2, but let's double-check and 
# actually do the math.
pnorm(-1.52, lower.tail = TRUE) + pnorm(1.52, lower.tail = FALSE)

## [1] 0.128511

# Graph it.  Our graph function doesn't let us look at the outside, so we can 
# look at the inside and subtract from 100%
normal_area(mean=0, sd=1, ub=-1.52, lb=1.52)

# Answer = 0.129

Exercise 3:

Suppose that of all voters in Florida, 40% (p=.4) are in favor of candidate Brown for Governor. Pollsters take a sample of 2400 voters. What proportion of the sample would be expected to favor candidate Brown? (show your work)

# Expected value is just the probability * sample size
2400*0.4

## [1] 960

# Alternatively, this is how we calculate an expected number of people who will vote for candidate Brown with a 95% confidence (I think).  This is not what we are being asked.
qbinom(0.05, size = 2400, prob = 0.4)

## [1] 921

# Answer = 960

Exercise 4:

Suppose the mean age of the population in a particular country is 53 years ( μ = 53 years) with σ = 5.5. An SRS of 100 people revealed a mean x ̅of 54.85 years. Use a two-sided test to determine if the sample mean is significantly higher than expected. α = .05

Show all hypothesis testing steps.

Hypothesis

Test Statistic

P-value

Conclusion

Hypothesis

\(H_0 =\) There is no difference between sample and population means.
\(H_a =\) The sample mean is not equal to the population mean.

Test (z) Statistic

\[z\ statistic = \frac{\bar{x} - \mu_0}{SE_x}\] \(\bar{x}\) is the sample average. \(\mu_0\) is the population average.

Standard error is the standard deviation of the sampling distribution. It is calculated as: \[SE_x = \frac{\sigma}{\sqrt{n}}\] \(SE_x\) is the standard error of the sample. \(\sigma\) is the standard deviation of the underlying population. \(n\) is the sample size.

So we can combine those equations:

\[z\ statistic = \frac{(\bar{x} - \mu_0)\sqrt{n}}{\sigma}\] Note that this is not very different from the classic z-score.
\[z = \frac{\bar{X}-\mu}{\sigma}\] The z-test is just the z-score but taking into consideration the number of samples in the numerator.

We were given the following data:

pop_mean <- 53
pop_sd <- 5.5
sample_size <- 100
sample_mean <- 54.85
alpha = 0.05

P-value

We need to decide on our p-value, given our alpha value of 0.05. Because we are doing a two-tailed test, we allot 1/2 of the alpha to each end. Therefore our p-value is actually 1/2 of this: 0.025

p_val <- alpha/2
p_val

## [1] 0.025

First, let’s figure out our acceptance and rejection regions so we do not get confused later:

# First, if we want to calculate the z-score from a p-value, just do this:
z_cutoff <- round(qnorm(p_val),4) # -1.96

Now let us graph the rejection region. Remember this is a two-tailed test:

# Lets graph it in red to show the rejection region (where we reject the null).
normal_area(mean=0, sd=1, ub=z_cutoff, lb=-z_cutoff, acolor = "red")

Looking at the above graph, it is clear that our z-test must

# Calculate the z-test, which is just a normalized mean of the sample
zstat <- ((sample_mean-pop_mean)*sqrt(sample_size))/pop_sd
zstat

## [1] 3.363636

# Note that we get the same results by multiplying the z-score by the square root of the 
# sample size:
z_score(sample_mean, pop_mean, pop_sd) * sqrt(sample_size)

## [1] 3.363636

Our z-test is 3.364. We are going to use this value in the same way we use a z-score. It is basically telling us that our sample population mean is 3.364 standard deviations above the population mean.

P-value

# Now we can use the z-score to plug into the pnorm, which will tell us what percentile
# of scores lies at or above this z-score.
one_sided_p <- pnorm(zstat, lower.tail = FALSE) # 0.0003846
one_sided_p

## [1] 0.0003846141

# Finally, we must multiply by two, because we are performing a two-tailed test.
two_sided_p <- one_sided_p * 2
two_sided_p

## [1] 0.0007692282

# Lets go ahead and graph this
zstat_u <- (sample_mean-pop_mean)*sqrt(sample_size) # z-test unadjusted for sd
normal_area(mean=pop_mean, sd=pop_sd, ub=pop_mean-zstat_u, lb=pop_mean+zstat_u)

# Answer = 0.000768

Conclusion

So, our final probability that our sample mean is no different from our population mean is 0.0077. This value is in the rejection region of our probability graph.

Our final conclusion is that we reject the null hypothesis. The null hypothesis states that there is no difference between the two means. Therefore we believe our alternative hypothesis is more likely, that there is a significant difference between the population and sample means.

# Example from the lecture slides
# pop_mean <- 170
# pop_sd <- 40
# sample_size <- 64
# sample_mean <- 173

UWF PHC 5050 Biostatistics - Week 4 Assignment

Christopher Rock

19-Sep-2023

Useful Functions

How to compute the Z-score.

Function to visualize a normal distribution.

Exercise 1:

a) If a student has a score of 125, what percentage of students have higher scores?

b) If a student has a score of 90, what percentage of students have higher scores?

Exercise 2:

Exercise 3:

Exercise 4:

Hypothesis

Test (z) Statistic

P-value

P-value

Conclusion