Pre-Exercise

How to compute the Z-score. A z-score is computed by: \[\frac{x-\bar{x}}{\sigma^2}\]

## Compute the z-score of 3.4, given a population mean of 4.8 and standard deviation of 1.3.
z_score <- function(x, mean, sd){
  (x - mean)/sd  
}
z_score(x=3.4, mean=4.8, sd=1.3)

## [1] -1.076923

Exercise 1

You go to the firing range to fire a gun. The probability of hitting the target is 2/3, but five guns are fired simultaneously. Using the Stat Calculator (http://stattrek.com/online-calculator/binomial.aspx):

1) What is the probability that the target is hit exactly once?

pbinom(1,size = 5, prob = 0.6666)

## [1] 0.04530042

2) What is the probability that the target is hit at least four times?

pbinom(3,size = 5, prob = 0.6666, lower.tail = FALSE)

## [1] 0.4607737

Exercise 2

A researcher wants to assess if children in kindergarten prefer to play with boys or girls. She randomly creates groups of two girls and one boy from a sample of 120 girls and 60 boys. One of the two girls is randomly determined to be the focal subject, while the other girl and boy are designated as companions. Each triplet is placed in a classroom for 10 minutes and observed. The observer assesses whether the focal subject spends more time closer to the male companion or to the female companion. The gender preference for each focal subject is recorded as: 1= prefer girl; 2= prefer boy. Using the gender preference dataset answer the following questions.

gender.sav

sexpref	number
1	34
2	26

a) Write the null and alternative hypotheses

H0 = The focal subject has no preference. Ha = The focal subject has a preference to play with either boys or girls.

c) What is the test value? (test proportion), why?

0.50, because our H0 is that the test subject has no preference. Having no preference when presented with two options should result in equally choosing each option, so each is chosen 50% of the time.

d) Is the p-value significant?

sexpref <- c(34, 26)
binom.test(sexpref, p = 0.5, alternative = "two.sided")

## 
##  Exact binomial test
## 
## data:  sexpref
## number of successes = 34, number of trials = 60, p-value = 0.3663
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4324097 0.6941186
## sample estimates:
## probability of success 
##              0.5666667

No, assuming we are using the commonly accepted p-value cutoff of 0.05 to indicate designate significance. Our p-value was 0.366, meaning if we repeated this test multiple times with randomly chosen values from a binomial distribution we could expect to get results like this or even more divided at least 36.6% of the time.

e) Report the results and include a snipit of the SPSS output.

We fail to reject the null hypothesis.

pbinom(34, 60, 0.5)

## [1] 0.8774696

Exercise 3

The table below shows the number of individuals with optimal, normal BMI and those who are overweight by gender.

	Optimal	Normal	Overweight	Total
Male	22	73	55	150
Female	43	132	65	240
Total	65	205	120	390

a. What proportion of the participants have optimal weight?

\(65 / 390 = 0.167 = 16.7%\)

b. What proportion of men have optimal weight?

\(22 / 150 = 0.147 = 14.7%\)

c. What proportion of participants who are overweight are men?

\(55 / 120 = 0.458 = 45.8%\)

d. Are overweight status and male gender independent?

To solve this, we will use the Chi-square test. (reference: http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r)

For each cell, we must calculate the expected value: \[e = \frac{row.sum * col.sum}{grand.total}\]

We are attempting to determine if the Null hypothesis holds:

Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent

The Chi-square statistic is as follows: \[\chi^2 = \sum{\frac{(o - e)^2}{e}}\] where, - o is the observed value - e is the expected value

This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with \(df=(r−1)(c−1)\) degrees of freedom and p = 0.05.

wt_sex <- data.frame(
  Optimal = c(22, 43),
  Normal = c(73,132),
  Overweight = c(55, 65),
  row.names = c('Male','Female')
)
wt_sex

##        Optimal Normal Overweight
## Male        22     73         55
## Female      43    132         65

chi_res <- chisq.test(wt_sex)

In order to know the most contributing cells to the total Chi-square score, you must calculate the Chi-square statistic for each cell:

\[r = \frac{o - e}{\sqrt{e}}\]

This is known as the Pearson residuals, and can be found with the following command:

round(chi_res$residuals, 3)

##        Optimal Normal Overweight
## Male    -0.600 -0.658      1.302
## Female   0.474  0.520     -1.029

Let’s plot that:

library(corrplot)

## corrplot 0.92 loaded

corrplot(chi_res$residuals, is.cor=FALSE)

Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.

Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.

The contribution in percent of a given cell to the total Chi-square score is calculated as follows: \[contrib = \frac{r^2}{\chi^2}\]

contrib <- 100*chi_res$residuals^2/chi_res$statistic
round(contrib,3)

##        Optimal Normal Overweight
## Male     8.901 10.717      41.92
## Female   5.563  6.698      26.20

# Visualize the contribution
corrplot(contrib, is.cor = FALSE)

# So, our final answer is **no, the values are not independent**.

Exercise 4

Suppose that test scores are normally distributed in a certain population with a mean of 55 and a standard deviation of 10.

a) What proportion of people have scores less than 65?

We can answer this question in two ways. One way is to get the z-score, then calculate the proportion of a normal distribution less than that z-score. Alternatively, we can calculate it in one command.

# Use our z-score function we defined previously.
zscore <- z_score(x=65, mean=55, sd=10)
zscore

## [1] 1

# The pnorm() function takes a z-score and returns the _lower tail_ under a normal curve.
pnorm(zscore)

## [1] 0.8413447

# Alternatively, we can give pnorm() all the information and do it in one fell swoop.
pnorm(65, mean=55, sd=10)

## [1] 0.8413447

# Answer = 84%

We can visualize the distribution with this function:

normal_area <- function(mean = 0, sd = 1, lb, ub, acolor = "lightblue", ...) {
  
  x_axis <- seq(mean - 3.3 * sd, mean + 3.3 * sd, length = 100) 
  
  if (missing(lb) & missing(ub)){
    lb_fill <- min(x_axis)
    ub_fill <- max(x_axis)
    title <- paste("100%")
    caption <- ""
  }
  else if (missing(lb)) {
    lb_fill <- min(x_axis)
    ub_fill <- ub
    area <- round(pnorm(ub, mean, sd)*100,1)
    ubz <- paste(round(z_score(ub, mean, sd), 3))
    title <- paste(area, "%\n", "x < ", ub, sep="")
    caption <- paste("z <", ubz)
  }
  else if (missing(ub)) {
    ub_fill <- max(x_axis)
    lb_fill <- lb
    area <- round(pnorm(lb, mean, sd, lower.tail = FALSE)*100,1)
    lbz <- paste(round(z_score(lb, mean, sd), 3))
    title <- paste(area, "%\n", "x > ", lb, sep="")
    caption <- paste("z >", lbz)
  }
  else{
    ub_fill <- ub
    lb_fill <- lb
    ubn <- pnorm(ub, mean, sd)
    lbn <- pnorm(lb, mean, sd)
    area <- round((ubn - lbn)*100, 1)
    ubz <- paste(round(z_score(ub, mean, sd), 3))
    lbz <- paste(round(z_score(lb, mean, sd), 3))
    title <- paste(area, "%\n", lb," < x < ", ub, sep="")
    caption <- paste(lbz, "< z <", ubz)
  }

  sub_title <- paste("μ:", mean, ", σ²:", sd)
  
  fill <- seq(lb_fill, ub_fill, length = 100)    
  
  plot(x_axis, dnorm(x_axis, mean, sd), type = "n", ylab = "Probability", xlab="x", main = title, sub=sub_title)
  mtext(caption,side = 3)
  
  y <- dnorm(fill, mean, sd)
  polygon(c(lb_fill, fill, ub_fill), c(0, y, 0), col = acolor)
  lines(x_axis, dnorm(x_axis, mean, sd), type = "l", ...)
  
}

Now call the function to display the graph.

normal_area(mean = 55, sd = 10, ub = 65)

b) What proportion of people have scores between 35 and 70?

pnorm(70, mean=55, sd=10) - pnorm(35, mean=55, sd=10)

## [1] 0.9104427

normal_area(mean = 55, sd = 10, lb = 35, ub = 70)

# Answer = 91.0%

c) How many of 1000 students who take the test should have scores between 37 and 70?

# We'll calculate the percentage first.
percent <- pnorm(70, mean=55, sd=10) - pnorm(37, mean=55, sd=10)
# Now multiply the area by 1000 and round to nearest whole number.  
round(1000 * percent)

## [1] 897

# Let's visualize the distribution
normal_area(mean = 55, sd = 10, lb = 37, ub = 70)

# Answer = 897 students

d) If someone has a score of 75, what percentile is he/she in?

pnorm(75, mean=55, sd=10)

## [1] 0.9772499

normal_area(mean = 55, sd = 10, ub = 75)

# Answer = 97.7 percentile

Exercise 5:

Among coffee drinkers, men drink a mean of 3.2 cups per day with a standard deviation of 0.8 cups. Assume the number of drinking per day follows a normal distribution. Show your work and, if possible, include the curve diagram.

a. What proportion drink 2 cups per day or more?

pnorm(2, mean = 3.2, sd = 0.8, lower.tail = FALSE)

## [1] 0.9331928

normal_area(mean = 3.2, sd = 0.8, lb = 2)

# Answer = 93.3%

b. What proportion drink no more than 4 cups per day?

# The answer to this question depends on how you define "no more than 4 cups per day."  We will define this to be less than 5.
pnorm(5, mean=3.2, sd=0.8)

## [1] 0.9877755

normal_area(mean = 3.2, sd = 0.8, ub = 5)

# Answer = 98.8%

c. If the top 5% of coffee drinkers are considered heavy coffee drinkers, what is the minimum number of cups consumed by a heavy coffee drinker?

# The qnorm() command returns the value of x which lies as a given probability percentile in a normal distribution with a given mean and standard deviation.
qnorm(0.95, mean=3.2, sd=0.8) # returns 4.516

## [1] 4.515883

# Let's look at the graph to confirm this.
normal_area(mean = 3.2, sd = 0.8, lb = 4.516)

# Answer = 4.516, or 5 cups (if rounding)

Exercise 6:

A study showed that in a survey of more than 15,000 men who had a heart attack, recovered, and then subsequently died, 60% died of a second (or later) heart attack, while 40% died of unrelated causes. The case histories of 20 men who have had a heart attack and recovered are under study. Determine the probability that at least 10 of these men will die of a later heart attack.

# To solve this, let's calculate the cumulative probability using the binomial equation.

prob_success <- 0.6 # In this case, "success" means death, our outcome of interest.
trials <- 20
num_success <- 10 # Again, "success" = death from secondary MI in this case.
  
prob_fail = 1.0 - prob_success
v_success = seq(from=0, to=trials, by=1)
v_fail = rev(v_success)
  
# Binomial trials
v_chs = choose(trials, v_success)
prob_x = v_chs * prob_success^v_success * prob_fail^v_fail

# Color code the columns for our plot.
cols = rep("seagreen", trials)
for (x in 1:10) {
  cols[num_success+x] = "purple"
}  

# Plot our results
barplot(prob_x,
        names.arg=v_success,
        xlab="Another MI",
        ylab="Probability",col=cols,
        main="Cause of Death in 20 MI Survivors",border="darkorange",
        lwd=3 #darken y-axis
)

# The variable prob_x now contains a vector of probabilities that a secondary MI was the cause of death for 0..20 individuals.  To find out the probability of death for "10 or more" we just need to sum up the last 11 
sum(tail(prob_x, 11))

## [1] 0.8724788

# Answer = 87.2%

# Alternatively, we could just run this one line to get the answer.
pbinom(9, size = 20, prob = 0.6, lower.tail = FALSE)

## [1] 0.8724788

UWF PHC 5050 Biostatistics - Week 3 Assignment

Christopher Rock

12-Sep-2023