In the Hair and Siblings residuals table, how many associations are we examining?
Are 95% of them within two standard deviations of the mean?
How about for the other comparisons you carried out in Exercise 2.5?
Exactly what % of a Normal distribution is within two standard deviations of the mean?
## data_2
## data_1 Black Blonde Brown Grey Red
## 0 1.54 -0.49 -1.66 1.14 0.19
## 1 -1.08 0.44 1.01 -0.80 0.32
## 2 -0.32 -0.32 0.71 -0.37 -0.53
## 3 -0.15 -0.60 0.58 -0.18 -0.25
## 4 -1.61 2.33 0.58 -0.12 -0.16
## 5 0.60 -0.22 -0.50 -0.07 -0.09
## 6 0.60 -0.22 -0.50 -0.07 -0.09
## There are 35 relationships.
I make a function:
miu <- mean(SHtable)
sdev <- sd(SHtable)
lower_bound <- miu - 2 * sdev
upper_bound <- miu + 2 * sdev
within_2sd <- SHtable [SHtable >= lower_bound & SHtable <= upper_bound]
number_within <- length(within_2sd)
perc_within <- number_within / length(SHtable)
percentage_check <- function (a , b) {
d = a - b
if (d > 0)
{
cat ("95 % of them within two standard deviations of the mean.") }
else {
cat ("**LESS THAN** 95 % of them within two standard deviations of the mean.") }
}
plug in the values of our data:
percentage_check (a = perc_within,
b = 0.95)
## **LESS THAN** 95 % of them within two standard deviations of the mean.
pnorm (q = 2, mean = 0, sd = 1) - pnorm (-2, 0, 1)
## [1] 0.9544997
# or
2 * pnorm (q = 2, mean = 0, sd = 1) - 1
## [1] 0.9544997
Write a function that takes two variables for the same set of objects as arguments and prints out the:
test statistic,
number of degrees of freedom and
p-value (to three decimal places where appropriate).
Test it against Siblings and Hair.
Do you expect there to be any difference if you change the order of the variables? Run the function against Hair and Fish. Write down your conclusion.
hyp_test <- function (a, b) {
obs_table <- table (a, b) # table of observed values
niplus <- rowSums (obs_table)
nplusj <- colSums (obs_table)
n <- sum(obs_table)
exp_table <- niplus %*% t(nplusj) /n # table of expected values
test_stat <- sum(
(obs_table - exp_table)^2
/
exp_table
) # this outputs the test statistic
d <- ( nrow (obs_table) - 1) * (ncol (obs_table) - 1)
p_value <- 1 - pchisq (q = test_stat, df = d)
p_value <- round (p_value, 3) # round to 3 d.p.
return (list ("test statistics" = test_stat,
"degree of freedom" = d,
"p-value" = p_value))
}
hyp_test (a = qanda$Siblings, b = qanda$Hair)
## $`test statistics`
## [1] 12.03986
##
## $`degree of freedom`
## [1] 24
##
## $`p-value`
## [1] 0.979
hyp_test (a = qanda$Hair, b = qanda$Siblings)
## $`test statistics`
## [1] 12.03986
##
## $`degree of freedom`
## [1] 24
##
## $`p-value`
## [1] 0.979
Run the function against Hair and Fish
hyp_test (a = qanda$Handspan, b = qanda$Fish)
## $`test statistics`
## [1] 88.10786
##
## $`degree of freedom`
## [1] 103
##
## $`p-value`
## [1] 0.852
Find out the name of the function in R for Fisher’s exact test.
Make sure that you understand what the test does and then run it against our Siblings and Hair data.
Compare the results from this test with those from the \(\chi^2\) test we performed above.
Do the same for Hair and Fish as in Exercise 5.2.
# fisher.test()
SH_fisher_test <- fisher.test (x = qanda$Siblings,
y = qanda$Hair,
workspace = 500000,
conf.level = 0.95,
simulate.p.value = TRUE)
SH_fisher_test
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: qanda$Siblings and qanda$Hair
## p-value = 0.6712
## alternative hypothesis: two.sided
HF_fisher_test <- fisher.test (x = qanda$Hair,
y = qanda$Fish)
HF_fisher_test
##
## Fisher's Exact Test for Count Data
##
## data: qanda$Hair and qanda$Fish
## p-value = 0.9108
## alternative hypothesis: two.sided
Notes:
This function is used to assess the association between two categorical variables when the sample size is small.
Fisher’s exact test is generally used for datasets with small sample sizes (< 1000), especially when the frequency count is < 5 for more than 20% of cells . The chi-square test should be used for datasets with large sample sizes.
Fisher’s exact test is an exact test which returns exact p value that represents the probability of obtaining the observed distribution assuming that the variables are independent.
Run the code given above and check that the new short vectors of Siblings and Hair are correct summaries of the original vectors.
Run the function you created in Exercise 5.2 against these new vectors.
What can you conclude about the relationship between the two variables now?
Compare the result with that from the original test and from Fisher’s exact test.
Hair_short <- rep ("Other", times = nrow (qanda)) # I first replace all hair color to Other
Hair_short [qanda$Hair == "Black"] <- "Black" # Then I retain black color only.
Hair_short <- as.factor (Hair_short)
Siblings_short <- rep (">1", times = nrow (qanda)) # I first replace all numbers with > 1
Siblings_short [qanda$Siblings == 0] <- "0" # I retain those with 0 Siblings
Siblings_short [qanda$Siblings == 1] <- "1" # I also retain those with 1 sibling
Siblings_short <- as.factor (Siblings_short)
let’s take a look at this updated table
## Hair_short
## Siblings_short Black Other
## >1 28 12
## 0 79 21
## 1 63 27
the original table
##
## Black Blonde Brown Grey Red
## 0 79 4 15 1 1
## 1 63 5 21 0 1
## 2 20 1 7 0 0
## 3 5 0 2 0 0
## 4 1 1 1 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
hyp_test (a = Hair_short, b = Siblings_short)
## $`test statistics`
## [1] 2.374412
##
## $`degree of freedom`
## [1] 2
##
## $`p-value`
## [1] 0.305
The original table
hyp_test (a = qanda$Hair, b = qanda$Siblings)
## $`test statistics`
## [1] 12.03986
##
## $`degree of freedom`
## [1] 24
##
## $`p-value`
## [1] 0.979
The fisher’s exact test:
fisher.test (x = Hair_short, y = Siblings_short)
##
## Fisher's Exact Test for Count Data
##
## data: Hair_short and Siblings_short
## p-value = 0.2876
## alternative hypothesis: two.sided
Work through the general process for hypothesis testing given in Section 5.1 using the data in Table 5.1.
Create a function which takes such a table (of any number of columns) as input and produces the same output as Exercise 5.2.
Work out how to use chisq.test() to perform a goodness of fit test and check the answer from that with the output from your function.
Have we proved that parents stop having children after the first girl? Why?
The following is a six-step general process for hypothesis testing and p-values:
Define your hypotheses: null, \(H_0\), and alternative, \(H_1\)
Specify the statistical test under \(H_0\)
State and check any assumptions you are making
Calculate the test statistic to be used in the test
Work out the p-value corresponding to the test statistic
Draw conclusions in terms of the wording of \(H_0\)
\(H_0\): The Geometric (0.5) model is a good fit.
\(H_1\): The Geometric (0.5) model is not a good fit.
Statistical test:
The probability of a female birth is 0.5;
We will ignore multiple births.
We can first look at the table
col_names <- c ("Observed", "Expected")
row_names <- c ("0", "1", "2", "3", ">=4")
Observed <- c (100, 90, 28, 7, 5)
Expected0to3 <- c (length(qanda$Siblings)
*
dgeom ((x = 0:3), prob = 0.5))
Expected_4to6 <- sum(length(qanda$Siblings)
*
pgeom (q = 3, prob = 0.5, lower.tail = FALSE))
matrix_data <- matrix (c (Observed, Expected0to3, Expected_4to6),
nrow = 2, byrow = TRUE)
matrix_data <- round (matrix_data , 1)
T <- as.data.frame (t(matrix_data))
rownames (T) <- row_names
colnames (T) <- col_names
T
## Observed Expected
## 0 100 115.0
## 1 90 57.5
## 2 28 28.8
## 3 7 14.4
## >=4 5 14.4
We need \((O_i - E_i)^2/E_i\) for the test statistic
Using the function!
hyp_test <- function (a) {
O <- a $ Observed
E <- a $ Expected
C <- ((O - E)^2) / E
chi2 <- sum (C)
D <- 1 * (length(Observed) - 1)
p_value <- 1 - pchisq (q = chi2, df = D)
return (list ("test statistics" = chi2,
"degree of freedom" = D,
"p-value" = p_value))
}
hyp_test (a = T)
## $`test statistics`
## [1] 30.2872
##
## $`degree of freedom`
## [1] 4
##
## $`p-value`
## [1] 4.277784e-06
How we interpret them? If we have 5 % significance level.
qchisq(p = 0.95, df = 4)
## [1] 9.487729
The critical value is 9.487729.
Since 30.2872 > 9.487729
There is sufficient evidence to reject \(H_0\).
There is sufficient evidence to state that a geometric (0.5) distribution is not a good model.
Or, we can use the p-value: 4.277784e-06
(The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, under the assumption that the null hypothesis is true.)
This means: the observed data is unlikely to have occurred if the null hypothesis were true.
Therefore, we have sufficient evidence to reject the null hypothesis at a 0.05 significance level.
Conclusion: the geometric (0.5) distribution is not a good model.
chisq.test(T$Observed, p = T$Expected / sum(T$Expected))
##
## Chi-squared test for given probabilities
##
## data: T$Observed
## X-squared = 30.3, df = 4, p-value = 4.252e-06
What does it means?
Download the clinical_trials.csv data set. In this synthetic data set, a company Badatta claims that they have developed a new drug Kilvir that can kill the common cold virus and have conducted clinical trials to prove this claim.
Here, each row in the data set represents data for one patient, including their
age group,
whether they have been treated with the Kilvir drug or a placebo (which has no therapeutic effect), and
whether the patient feels the drug has effectively improved their conditions.
Extract a subset of the data containing only patients in the 31-35 age group.
Construct a contingency table of treatment drug versus treatment effect for this age group.
Is there evidence that Kilvir is effective in this age group?
Now have a look at the other age groups.
ct <- read.csv ("clinical_trials.csv")
head (ct)
## PatientID AgeGroup Drug Result
## 1 P1 56-60 Kilvir Effective
## 2 P2 26-30 Placebo Effective
## 3 P3 61-65 Placebo Ineffective
## 4 P4 61-65 Kilvir Effective
## 5 P5 16-20 Placebo Effective
## 6 P6 81-85 Placebo Effective
\(H_0\): the new drug Kilvir is not effective.
\(H_1\): the new drug Kilvir is effective.
Mostly, we put what we want to prove in the alternative hypothesis.
Here’s the first five age groups shown:
table (ct$AgeGroup) [1:5]
##
## 16-20 21-25 26-30 31-35 36-40
## 68 46 84 34 52
ctrowD <- which(ct$AgeGroup == "31-35")
dfD <- ct [ ctrowD, ]
contD <- table (dfD$Drug, dfD$Result)
contD
##
## Effective Ineffective
## Kilvir 9 8
## Placebo 2 15
chisq.test(contD)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contD
## X-squared = 4.8379, df = 1, p-value = 0.02784
fisher.test(contD)
##
## Fisher's Exact Test for Count Data
##
## data: contD
## p-value = 0.02551
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.217552 92.482179
## sample estimates:
## odds ratio
## 7.882191
If we have 5% of significance level, then there is insufficient evidence to reject that there is no relationship between using different types of drugs and the overall effect (effective / ineffective).
But if we only have significance level of 1%, then there is sufficient evidence to reject \(H_0\).
An odds ratio greater than 1 indicates a positive association.
odds ratio of 7.88 indicates a strong positive relationship.
Given that I use the different drugs (drugs are explanatory variable) what is the probability that it is effective (effects as the response variable)
The results states that Kilvir is about 8 times more effective than Placebo for this certain age group.
No.
Because publishing results for only one age group may not provide a representative picture of the overall population;
It might give a misleading impression of the overall findings if results are cherry-picked without considering the broader context.
Results from one age group may be confounded by other variables.
If it just publish the 31-35 age group data, we can see that p-value is not very low. When the population (people who take Kilvir) grows, there will be more issues.
##
## Effective Ineffective
## Kilvir 18 39
## Placebo 24 33
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contAB
## X-squared = 0.94246, df = 1, p-value = 0.3316
##
## Fisher's Exact Test for Count Data
##
## data: contAB
## p-value = 0.3317
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2737906 1.4623951
## sample estimates:
## odds ratio
## 0.6371842
##
## Effective Ineffective
## Kilvir 4 9
## Placebo 4 9
##
## Pearson's Chi-squared test
##
## data: contNO
## X-squared = 0, df = 1, p-value = 1
##
## Fisher's Exact Test for Count Data
##
## data: contNO
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1373006 7.2832873
## sample estimates:
## odds ratio
## 1
## p-value (chi-squared) p-value (fisher)
## 16-20 0.1240666 0.1231997
## 26-30 0.6403202 0.64078
## 31-35 0.02784006 0.02550983
## 36-40 0.3482841 0.3487349
## 41-45 0.7125793 0.713982
## 46-50 1 1
## 51-55 1 1
## 56-60 0.7300697 0.7310941
## 61-65 0.3906743 0.3911247
## 66-70 1 1
## 71-75 0.6932818 0.6945843
## 76-80 0.3990752 0.4003231
## 81-85 0.6170751 0.6199095
## 86-90 0.4142162 0.4285714
Other groups (except 31 - 35) have much higher p-value.
In three age-groups, the Kilvin is statistically ineffective.
Other age groups have p-values much greater than 10%, even some with 70%.
There is insufficient evidence to reject \(H_0\): the new drug Kilvir is not effective.
Well, it is ineffective… when considering people of all age groups.