2024-02-07
In the Hair and Siblings residuals table, how many associations are we examining?
Are 95% of them within two standard deviations of the mean?
How about for the other comparisons you carried out in Exercise 2.5?
Exactly what % of a Normal distribution is within two standard deviations of the mean?
## data_2 ## data_1 Black Blonde Brown Grey Red ## 0 1.54 -0.49 -1.66 1.14 0.19 ## 1 -1.08 0.44 1.01 -0.80 0.32 ## 2 -0.32 -0.32 0.71 -0.37 -0.53 ## 3 -0.15 -0.60 0.58 -0.18 -0.25 ## 4 -1.61 2.33 0.58 -0.12 -0.16 ## 5 0.60 -0.22 -0.50 -0.07 -0.09 ## 6 0.60 -0.22 -0.50 -0.07 -0.09
## There are 35 relationships.
I make a function:
miu <- mean(SHtable)
sdev <- sd(SHtable)
lower_bound <- miu - 2 * sdev
upper_bound <- miu + 2 * sdev
within_2sd <- SHtable [SHtable >= lower_bound & SHtable <= upper_bound]
number_within <- length(within_2sd)
perc_within <- number_within / length(SHtable)
percentage_check <- function (a , b) {
d = a - b
if (d > 0)
{
cat ("95 % of them within two standard deviations of the mean.") }
else {
cat ("**LESS THAN** 95 % of them within two standard deviations of the mean."
)
}
}
percentage_check (a = perc_within,
b = 0.95)
## **LESS THAN** 95 % of them within two standard deviations of the mean.
pnorm (q = 2, mean = 0, sd = 1) - pnorm (-2, 0, 1)
## [1] 0.9544997
# or 2 * pnorm (q = 2, mean = 0, sd = 1) - 1
## [1] 0.9544997
Write a function that takes two variables for the same set of objects as arguments and prints out the:
test statistic,
number of degrees of freedom and
p-value (to three decimal places where appropriate).
Test it against Siblings and Hair.
Do you expect there to be any difference if you change the order of the variables? Run the function against Hair and Fish. Write down your conclusion.
hyp_test <- function (a, b) {
obs_table <- table (a, b) # table of observed values
niplus <- rowSums (obs_table)
nplusj <- colSums (obs_table)
n <- sum(obs_table)
exp_table <- niplus %*% t(nplusj) /n # table of expected values
test_stat <- sum(
(obs_table - exp_table)^2
/
exp_table
) # this outputs the test statistic
d <- ( nrow (obs_table) - 1) * (ncol (obs_table) - 1)
p_value <- 1 - pchisq (q = test_stat, df = d)
p_value <- round (p_value, 3) # round to 3 d.p.
return (list ("test statistics" = test_stat,
"degree of freedom" = d,
"p-value" = p_value))
}
hyp_test (a = qanda$Siblings, b = qanda$Hair)
## $`test statistics` ## [1] 12.03986 ## ## $`degree of freedom` ## [1] 24 ## ## $`p-value` ## [1] 0.979
hyp_test (a = qanda$Hair, b = qanda$Siblings)
## $`test statistics` ## [1] 12.03986 ## ## $`degree of freedom` ## [1] 24 ## ## $`p-value` ## [1] 0.979
hyp_test (a = qanda$Handspan, b = qanda$Fish)
## $`test statistics` ## [1] 88.10786 ## ## $`degree of freedom` ## [1] 103 ## ## $`p-value` ## [1] 0.852
Find out the name of the function in R for Fisher’s exact test.
Make sure that you understand what the test does and then run it against our Siblings and Hair data.
Compare the results from this test with those from the \(\chi^2\) test we performed above.
Do the same for Hair and Fish as in Exercise 5.2.
# fisher.test()
SH_fisher_test <- fisher.test (x = qanda$Siblings,
y = qanda$Hair,
workspace = 500000,
conf.level = 0.95,
simulate.p.value = TRUE)
SH_fisher_test
## ## Fisher's Exact Test for Count Data with simulated p-value (based on ## 2000 replicates) ## ## data: qanda$Siblings and qanda$Hair ## p-value = 0.6647 ## alternative hypothesis: two.sided
HF_fisher_test <- fisher.test (x = qanda$Hair,
y = qanda$Fish)
HF_fisher_test
## ## Fisher's Exact Test for Count Data ## ## data: qanda$Hair and qanda$Fish ## p-value = 0.9108 ## alternative hypothesis: two.sided
Run the code given above and check that the new short vectors of Siblings and Hair are correct summaries of the original vectors.
Run the function you created in Exercise 5.2 against these new vectors.
What can you conclude about the relationship between the two variables now?
Compare the result with that from the original test and from Fisher’s exact test.
Hair_short <- rep ("Other", times = nrow (qanda)) # I first replace all hair color to Other
Hair_short [qanda$Hair == "Black"] <- "Black" # Then I retain black color only.
Hair_short <- as.factor (Hair_short)
Siblings_short <- rep (">1", times = nrow (qanda)) # I first replace all numbers with > 1
Siblings_short [qanda$Siblings == 0] <- "0" # I retain those with 0 Siblings
Siblings_short [qanda$Siblings == 1] <- "1" # I also retain those with 1 sibling
Siblings_short <- as.factor (Siblings_short)
## Hair_short ## Siblings_short Black Other ## >1 28 12 ## 0 79 21 ## 1 63 27
## ## Black Blonde Brown Grey Red ## 0 79 4 15 1 1 ## 1 63 5 21 0 1 ## 2 20 1 7 0 0 ## 3 5 0 2 0 0 ## 4 1 1 1 0 0 ## 5 1 0 0 0 0 ## 6 1 0 0 0 0
hyp_test (a = Hair_short, b = Siblings_short)
## $`test statistics` ## [1] 2.374412 ## ## $`degree of freedom` ## [1] 2 ## ## $`p-value` ## [1] 0.305
The original table
hyp_test (a = qanda$Hair, b = qanda$Siblings)
## $`test statistics` ## [1] 12.03986 ## ## $`degree of freedom` ## [1] 24 ## ## $`p-value` ## [1] 0.979
fisher.test (x = Hair_short, y = Siblings_short)
## ## Fisher's Exact Test for Count Data ## ## data: Hair_short and Siblings_short ## p-value = 0.2876 ## alternative hypothesis: two.sided
Work through the general process for hypothesis testing given in Section 5.1 using the data in Table 5.1.
Create a function which takes such a table (of any number of columns) as input and produces the same output as Exercise 5.2.
Work out how to use chisq.test() to perform a goodness of fit test and check the answer from that with the output from your function.
Have we proved that parents stop having children after the first girl? Why?
The following is a six-step general process for hypothesis testing and p-values:
Define your hypotheses: null, \(H_0\), and alternative, \(H_1\)
Specify the statistical test under \(H_0\)
State and check any assumptions you are making
Calculate the test statistic to be used in the test
Work out the p-value corresponding to the test statistic
Draw conclusions in terms of the wording of \(H_0\)
\(H_0\): The Geometric (0.5) model is a good fit.
\(H_1\): The Geometric (0.5) model is not a good fit.
Statistical test:
The probability of a female birth is 0.5;
We will ignore multiple births.
We can first look at the table
## Observed Expected ## 0 100 115.0 ## 1 90 57.5 ## 2 28 28.8 ## 3 7 14.4 ## >=4 5 14.4
hyp_test <- function (a) {
O <- a $ Observed
E <- a $ Expected
C <- ((O - E)^2) / E
chi2 <- sum (C)
D <- 1 * (length(Observed) - 1)
p_value <- 1 - pchisq (q = chi2, df = D)
return (list ("test statistics" = chi2,
"degree of freedom" = D,
"p-value" = p_value))
}
hyp_test (a = T)
## $`test statistics` ## [1] 30.2872 ## ## $`degree of freedom` ## [1] 4 ## ## $`p-value` ## [1] 4.277784e-06
How we interpret them? If we have 5 % significance level.
qchisq(p = 0.95, df = 4)
## [1] 9.487729
Or, we can use the p-value: 4.277784e-06
(The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, under the assumption that the null hypothesis is true.)
-> Conclusion: the geometric (0.5) distribution is not a good model.
chisq.test(T$Observed, p = T$Expected / sum(T$Expected))
## ## Chi-squared test for given probabilities ## ## data: T$Observed ## X-squared = 30.3, df = 4, p-value = 4.252e-06
Download the clinical_trials.csv data set. In this synthetic data set, a company Badatta claims that they have developed a new drug Kilvir that can kill the common cold virus and have conducted clinical trials to prove this claim.
Here, each row in the data set represents data for one patient, including their
age group,
whether they have been treated with the Kilvir drug or a placebo (which has no therapeutic effect), and
whether the patient feels the drug has effectively improved their conditions.
Extract a subset of the data containing only patients in the 31-35 age group.
Construct a contingency table of treatment drug versus treatment effect for this age group.
Is there evidence that Kilvir is effective in this age group?
Now have a look at the other age groups.
ct <- read.csv ("clinical_trials.csv")
head (ct)
## PatientID AgeGroup Drug Result ## 1 P1 56-60 Kilvir Effective ## 2 P2 26-30 Placebo Effective ## 3 P3 61-65 Placebo Ineffective ## 4 P4 61-65 Kilvir Effective ## 5 P5 16-20 Placebo Effective ## 6 P6 81-85 Placebo Effective
Here’s the first five age groups shown:
table (ct$AgeGroup) [1:5]
## ## 16-20 21-25 26-30 31-35 36-40 ## 68 46 84 34 52
ctrowD <- which(ct$AgeGroup == "31-35")
dfD <- ct [ ctrowD, ] contD <- table (dfD$Drug, dfD$Result) contD
## ## Effective Ineffective ## Kilvir 9 8 ## Placebo 2 15
chisq.test(contD)$p.value
## [1] 0.02784006
fisher.test(contD)$p.value
## [1] 0.02550983
If we have 5% of significance level, then there is sufficient evidence to reject \(H_0\) that there the new drug Kilvin is effective
But if we only have significance level of 1% (in the case of testing drugs), then there is insufficient evidence to reject \(H_0\), it is not effective.
## odds ratio ## 7.882191
## ## Effective Ineffective ## Kilvir 18 39 ## Placebo 24 33
chisq.test(contAB)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: contAB ## X-squared = 0.94246, df = 1, p-value = 0.3316
fisher.test(contAB)
## ## Fisher's Exact Test for Count Data ## ## data: contAB ## p-value = 0.3317 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 0.2737906 1.4623951 ## sample estimates: ## odds ratio ## 0.6371842
## ## Effective Ineffective ## Kilvir 4 9 ## Placebo 4 9
chisq.test(contNO)
## ## Pearson's Chi-squared test ## ## data: contNO ## X-squared = 0, df = 1, p-value = 1
fisher.test(contNO)
## ## Fisher's Exact Test for Count Data ## ## data: contNO ## p-value = 1 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 0.1373006 7.2832873 ## sample estimates: ## odds ratio ## 1
test <- function(age){
ctrowi <- which(ct$AgeGroup == age)
dfi <- ct [ ctrowi, ]
conti <- table (dfi$Drug, dfi$Result)
chi <- chisq.test(conti)
fisher <- fisher.test(conti)
return (list ("p-value (chi-squared)" = chi$p.value,
"p-value (fisher)" = fisher$p.value)
)
}
age.index <- c ("16-20", "26-30", "31-35", "36-40",
"41-45", "46-50", "51-55", "56-60",
"61-65", "66-70", "71-75", "76-80",
"81-85", "86-90") # different ages
p.vec <- sapply(age.index, test)
# apply chi-squared test for each age group
## p-value (chi-squared) p-value (fisher) ## 16-20 0.1240666 0.1231997 ## 26-30 0.6403202 0.64078 ## 31-35 0.02784006 0.02550983 ## 36-40 0.3482841 0.3487349 ## 41-45 0.7125793 0.713982 ## 46-50 1 1 ## 51-55 1 1 ## 56-60 0.7300697 0.7310941 ## 61-65 0.3906743 0.3911247 ## 66-70 1 1 ## 71-75 0.6932818 0.6945843 ## 76-80 0.3990752 0.4003231 ## 81-85 0.6170751 0.6199095 ## 86-90 0.4142162 0.4285714