How to compute the Z-score. A z-score is computed by: \[\frac{x-\bar{x}}{\sigma^2}\]
## Compute the z-score of 3.4, given a population mean of 4.8 and standard deviation of 1.3.
z_score <- function(x, mean, sd){
(x - mean)/sd
}
z_score(x=3.4, mean=4.8, sd=1.3)
## [1] -1.076923
You go to the firing range to fire a gun. The probability of hitting the target is 2/3, but five guns are fired simultaneously. Using the Stat Calculator (http://stattrek.com/online-calculator/binomial.aspx):
pbinom(1,size = 5, prob = 0.6666)
## [1] 0.04530042
pbinom(3,size = 5, prob = 0.6666, lower.tail = FALSE)
## [1] 0.4607737
A researcher wants to assess if children in kindergarten prefer to play with boys or girls. She randomly creates groups of two girls and one boy from a sample of 120 girls and 60 boys. One of the two girls is randomly determined to be the focal subject, while the other girl and boy are designated as companions. Each triplet is placed in a classroom for 10 minutes and observed. The observer assesses whether the focal subject spends more time closer to the male companion or to the female companion. The gender preference for each focal subject is recorded as: 1= prefer girl; 2= prefer boy. Using the gender preference dataset answer the following questions.
gender.sav
| sexpref | number |
|---|---|
| 1 | 34 |
| 2 | 26 |
H0 = The focal subject has no preference. Ha = The focal subject has a preference to play with either boys or girls.
0.50, because our H0 is that the test subject has no preference. Having no preference when presented with two options should result in equally choosing each option, so each is chosen 50% of the time.
sexpref <- c(34, 26)
binom.test(sexpref, p = 0.5, alternative = "two.sided")
##
## Exact binomial test
##
## data: sexpref
## number of successes = 34, number of trials = 60, p-value = 0.3663
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4324097 0.6941186
## sample estimates:
## probability of success
## 0.5666667
No, assuming we are using the commonly accepted p-value cutoff of 0.05 to indicate designate significance. Our p-value was 0.366, meaning if we repeated this test multiple times with randomly chosen values from a binomial distribution we could expect to get results like this or even more divided at least 36.6% of the time.
We fail to reject the null hypothesis.
pbinom(34, 60, 0.5)
## [1] 0.8774696
The table below shows the number of individuals with optimal, normal BMI and those who are overweight by gender.
| Optimal | Normal | Overweight | Total | |
|---|---|---|---|---|
| Male | 22 | 73 | 55 | 150 |
| Female | 43 | 132 | 65 | 240 |
| Total | 65 | 205 | 120 | 390 |
\(65 / 390 = 0.167 = 16.7%\)
\(22 / 150 = 0.147 = 14.7%\)
\(55 / 120 = 0.458 = 45.8%\)
To solve this, we will use the Chi-square test. (reference: http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r)
For each cell, we must calculate the expected value: \[e = \frac{row.sum * col.sum}{grand.total}\]
We are attempting to determine if the Null hypothesis holds:
The Chi-square statistic is as follows: \[\chi^2 = \sum{\frac{(o - e)^2}{e}}\] where, - o is the observed value - e is the expected value
This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with \(df=(r−1)(c−1)\) degrees of freedom and p = 0.05.
wt_sex <- data.frame(
Optimal = c(22, 43),
Normal = c(73,132),
Overweight = c(55, 65),
row.names = c('Male','Female')
)
wt_sex
## Optimal Normal Overweight
## Male 22 73 55
## Female 43 132 65
chi_res <- chisq.test(wt_sex)
In order to know the most contributing cells to the total Chi-square score, you must calculate the Chi-square statistic for each cell:
\[r = \frac{o - e}{\sqrt{e}}\]
This is known as the Pearson residuals, and can be found with the following command:
round(chi_res$residuals, 3)
## Optimal Normal Overweight
## Male -0.600 -0.658 1.302
## Female 0.474 0.520 -1.029
Let’s plot that:
library(corrplot)
## corrplot 0.92 loaded
corrplot(chi_res$residuals, is.cor=FALSE)
Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.
The contribution in percent of a given cell to the total Chi-square score is calculated as follows: \[contrib = \frac{r^2}{\chi^2}\]
contrib <- 100*chi_res$residuals^2/chi_res$statistic
round(contrib,3)
## Optimal Normal Overweight
## Male 8.901 10.717 41.92
## Female 5.563 6.698 26.20
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)
# So, our final answer is **no, the values are not independent**.
Suppose that test scores are normally distributed in a certain population with a mean of 55 and a standard deviation of 10.
We can answer this question in two ways. One way is to get the z-score, then calculate the proportion of a normal distribution less than that z-score. Alternatively, we can calculate it in one command.
# Use our z-score function we defined previously.
zscore <- z_score(x=65, mean=55, sd=10)
zscore
## [1] 1
# The pnorm() function takes a z-score and returns the _lower tail_ under a normal curve.
pnorm(zscore)
## [1] 0.8413447
# Alternatively, we can give pnorm() all the information and do it in one fell swoop.
pnorm(65, mean=55, sd=10)
## [1] 0.8413447
# Answer = 84%
We can visualize the distribution with this function:
normal_area <- function(mean = 0, sd = 1, lb, ub, acolor = "lightblue", ...) {
x_axis <- seq(mean - 3.3 * sd, mean + 3.3 * sd, length = 100)
if (missing(lb) & missing(ub)){
lb_fill <- min(x_axis)
ub_fill <- max(x_axis)
title <- paste("100%")
caption <- ""
}
else if (missing(lb)) {
lb_fill <- min(x_axis)
ub_fill <- ub
area <- round(pnorm(ub, mean, sd)*100,1)
ubz <- paste(round(z_score(ub, mean, sd), 3))
title <- paste(area, "%\n", "x < ", ub, sep="")
caption <- paste("z <", ubz)
}
else if (missing(ub)) {
ub_fill <- max(x_axis)
lb_fill <- lb
area <- round(pnorm(lb, mean, sd, lower.tail = FALSE)*100,1)
lbz <- paste(round(z_score(lb, mean, sd), 3))
title <- paste(area, "%\n", "x > ", lb, sep="")
caption <- paste("z >", lbz)
}
else{
ub_fill <- ub
lb_fill <- lb
ubn <- pnorm(ub, mean, sd)
lbn <- pnorm(lb, mean, sd)
area <- round((ubn - lbn)*100, 1)
ubz <- paste(round(z_score(ub, mean, sd), 3))
lbz <- paste(round(z_score(lb, mean, sd), 3))
title <- paste(area, "%\n", lb," < x < ", ub, sep="")
caption <- paste(lbz, "< z <", ubz)
}
sub_title <- paste("μ:", mean, ", σ²:", sd)
fill <- seq(lb_fill, ub_fill, length = 100)
plot(x_axis, dnorm(x_axis, mean, sd), type = "n", ylab = "Probability", xlab="x", main = title, sub=sub_title)
mtext(caption,side = 3)
y <- dnorm(fill, mean, sd)
polygon(c(lb_fill, fill, ub_fill), c(0, y, 0), col = acolor)
lines(x_axis, dnorm(x_axis, mean, sd), type = "l", ...)
}
Now call the function to display the graph.
normal_area(mean = 55, sd = 10, ub = 65)
pnorm(70, mean=55, sd=10) - pnorm(35, mean=55, sd=10)
## [1] 0.9104427
normal_area(mean = 55, sd = 10, lb = 35, ub = 70)
# Answer = 91.0%
# We'll calculate the percentage first.
percent <- pnorm(70, mean=55, sd=10) - pnorm(37, mean=55, sd=10)
# Now multiply the area by 1000 and round to nearest whole number.
round(1000 * percent)
## [1] 897
# Let's visualize the distribution
normal_area(mean = 55, sd = 10, lb = 37, ub = 70)
# Answer = 897 students
pnorm(75, mean=55, sd=10)
## [1] 0.9772499
normal_area(mean = 55, sd = 10, ub = 75)
# Answer = 97.7 percentile
Among coffee drinkers, men drink a mean of 3.2 cups per day with a standard deviation of 0.8 cups. Assume the number of drinking per day follows a normal distribution. Show your work and, if possible, include the curve diagram.
pnorm(2, mean = 3.2, sd = 0.8, lower.tail = FALSE)
## [1] 0.9331928
normal_area(mean = 3.2, sd = 0.8, lb = 2)
# Answer = 93.3%
# The answer to this question depends on how you define "no more than 4 cups per day." We will define this to be less than 5.
pnorm(5, mean=3.2, sd=0.8)
## [1] 0.9877755
normal_area(mean = 3.2, sd = 0.8, ub = 5)
# Answer = 98.8%
# The qnorm() command returns the value of x which lies as a given probability percentile in a normal distribution with a given mean and standard deviation.
qnorm(0.95, mean=3.2, sd=0.8) # returns 4.516
## [1] 4.515883
# Let's look at the graph to confirm this.
normal_area(mean = 3.2, sd = 0.8, lb = 4.516)
# Answer = 4.516, or 5 cups (if rounding)
A study showed that in a survey of more than 15,000 men who had a heart attack, recovered, and then subsequently died, 60% died of a second (or later) heart attack, while 40% died of unrelated causes. The case histories of 20 men who have had a heart attack and recovered are under study. Determine the probability that at least 10 of these men will die of a later heart attack.
# To solve this, let's calculate the cumulative probability using the binomial equation.
prob_success <- 0.6 # In this case, "success" means death, our outcome of interest.
trials <- 20
num_success <- 10 # Again, "success" = death from secondary MI in this case.
prob_fail = 1.0 - prob_success
v_success = seq(from=0, to=trials, by=1)
v_fail = rev(v_success)
# Binomial trials
v_chs = choose(trials, v_success)
prob_x = v_chs * prob_success^v_success * prob_fail^v_fail
# Color code the columns for our plot.
cols = rep("seagreen", trials)
for (x in 1:10) {
cols[num_success+x] = "purple"
}
# Plot our results
barplot(prob_x,
names.arg=v_success,
xlab="Another MI",
ylab="Probability",col=cols,
main="Cause of Death in 20 MI Survivors",border="darkorange",
lwd=3 #darken y-axis
)
# The variable prob_x now contains a vector of probabilities that a secondary MI was the cause of death for 0..20 individuals. To find out the probability of death for "10 or more" we just need to sum up the last 11
sum(tail(prob_x, 11))
## [1] 0.8724788
# Answer = 87.2%
# Alternatively, we could just run this one line to get the answer.
pbinom(9, size = 20, prob = 0.6, lower.tail = FALSE)
## [1] 0.8724788