Exercise 5

2024-02-07

Exercise 5.1

In the Hair and Siblings residuals table, how many associations are we examining?
Are 95% of them within two standard deviations of the mean?
How about for the other comparisons you carried out in Exercise 2.5?
Exactly what % of a Normal distribution is within two standard deviations of the mean?

make a residual table

##       data_2
## data_1 Black Blonde Brown  Grey   Red
##      0  1.54  -0.49 -1.66  1.14  0.19
##      1 -1.08   0.44  1.01 -0.80  0.32
##      2 -0.32  -0.32  0.71 -0.37 -0.53
##      3 -0.15  -0.60  0.58 -0.18 -0.25
##      4 -1.61   2.33  0.58 -0.12 -0.16
##      5  0.60  -0.22 -0.50 -0.07 -0.09
##      6  0.60  -0.22 -0.50 -0.07 -0.09

In the Hair and Siblings residuals table, how many associations are we examining?

## There are 35 relationships.

95% of them within two sd of mean?

I make a function:

miu <- mean(SHtable)

sdev <- sd(SHtable)

lower_bound <- miu - 2 * sdev
upper_bound <- miu + 2 * sdev

within_2sd <- SHtable [SHtable >= lower_bound & SHtable <= upper_bound]

number_within <- length(within_2sd)
perc_within <- number_within / length(SHtable)

percentage_check <- function (a , b) {
  d = a - b
  if (d > 0) 
  {
    cat ("95 % of them within two standard deviations of the mean.")    }
  else {
    cat ("**LESS THAN** 95 % of them within two standard deviations of the mean."
         )    
        }
}

plug in the values of our data:

percentage_check (a = perc_within,
                  b = 0.95)

## **LESS THAN** 95 % of them within two standard deviations of the mean.

Exactly what % of a Normal distribution is within two standard deviations of the mean?

pnorm (q = 2, mean = 0, sd = 1) - pnorm (-2, 0, 1)

## [1] 0.9544997

# or

2 * pnorm (q = 2, mean = 0, sd = 1) - 1

## [1] 0.9544997

Exercise 5.2

Write a function that takes two variables for the same set of objects as arguments and prints out the:

test statistic,
number of degrees of freedom and
p-value (to three decimal places where appropriate).

Test it against Siblings and Hair.

Do you expect there to be any difference if you change the order of the variables? Run the function against Hair and Fish. Write down your conclusion.

I create a function:

hyp_test <- function (a, b) {
  
  obs_table <- table (a, b) # table of observed values
  
  niplus <- rowSums (obs_table)
  nplusj <- colSums (obs_table)
  n <- sum(obs_table)
  
  exp_table <- niplus %*% t(nplusj) /n # table of expected values
  
  test_stat <- sum(
    (obs_table - exp_table)^2
    /
      exp_table
  )                       # this outputs the test statistic
  
  d <- ( nrow (obs_table) - 1) * (ncol (obs_table) - 1)
  
  p_value <- 1 - pchisq (q = test_stat, df = d)
  p_value <- round (p_value, 3)    # round to 3 d.p.
  
  return (list ("test statistics" = test_stat,
                "degree of freedom" = d,
                "p-value" = p_value))
}

Test it against Siblings and Hair

hyp_test (a = qanda$Siblings, b = qanda$Hair)

## $`test statistics`
## [1] 12.03986
## 
## $`degree of freedom`
## [1] 24
## 
## $`p-value`
## [1] 0.979

What about changing the order? – no difference

hyp_test (a = qanda$Hair, b = qanda$Siblings)

## $`test statistics`
## [1] 12.03986
## 
## $`degree of freedom`
## [1] 24
## 
## $`p-value`
## [1] 0.979

Run the function against Hair and Fish

hyp_test (a = qanda$Handspan, b = qanda$Fish)

## $`test statistics`
## [1] 88.10786
## 
## $`degree of freedom`
## [1] 103
## 
## $`p-value`
## [1] 0.852

Conclusion:

Hair color and Siblings are independent.

Exercise 5.3.

Find out the name of the function in R for Fisher’s exact test.
Make sure that you understand what the test does and then run it against our Siblings and Hair data.
Compare the results from this test with those from the \(\chi^2\) test we performed above.
Do the same for Hair and Fish as in Exercise 5.2.

Name of the function

# fisher.test()

run it against our Siblings and Hair data

SH_fisher_test <- fisher.test (x = qanda$Siblings,
                               y = qanda$Hair,
                               workspace = 500000,
                               conf.level = 0.95,
                               simulate.p.value = TRUE)

SH_fisher_test

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  qanda$Siblings and qanda$Hair
## p-value = 0.6647
## alternative hypothesis: two.sided

Hair and Fish

HF_fisher_test <- fisher.test (x = qanda$Hair,
                               y = qanda$Fish)
HF_fisher_test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  qanda$Hair and qanda$Fish
## p-value = 0.9108
## alternative hypothesis: two.sided

Notes:

This function is used to assess the association between two categorical variables when the sample size is small.

Fisher’s exact test is generally used for datasets with small sample sizes (< 1000), especially when the frequency count is < 5 for more than 20% of cells . The chi-square test should be used for datasets with large sample sizes.

Fisher’s exact test is an exact test which returns exact p value that represents the probability of obtaining the observed distribution assuming that the variables are independent.

Exercise 5.4.

Run the code given above and check that the new short vectors of Siblings and Hair are correct summaries of the original vectors.
Run the function you created in Exercise 5.2 against these new vectors.
What can you conclude about the relationship between the two variables now?
Compare the result with that from the original test and from Fisher’s exact test.

The given code:

Hair_short <- rep ("Other", times = nrow (qanda)) # I first replace all hair color to Other
Hair_short [qanda$Hair == "Black"] <- "Black" # Then I retain black color only.

Hair_short <- as.factor (Hair_short)

Siblings_short <- rep (">1", times = nrow (qanda))  # I first replace all numbers with > 1
Siblings_short [qanda$Siblings == 0] <- "0" # I retain those with 0 Siblings
Siblings_short [qanda$Siblings == 1] <- "1" # I also retain those with 1 sibling

Siblings_short <- as.factor (Siblings_short)

let’s take a look at this updated table

##               Hair_short
## Siblings_short Black Other
##             >1    28    12
##             0     79    21
##             1     63    27

the original table

##    
##     Black Blonde Brown Grey Red
##   0    79      4    15    1   1
##   1    63      5    21    0   1
##   2    20      1     7    0   0
##   3     5      0     2    0   0
##   4     1      1     1    0   0
##   5     1      0     0    0   0
##   6     1      0     0    0   0

Run the function for this new table

hyp_test (a = Hair_short, b = Siblings_short)

## $`test statistics`
## [1] 2.374412
## 
## $`degree of freedom`
## [1] 2
## 
## $`p-value`
## [1] 0.305

compare with the original and the Fisher’s

The original table

hyp_test (a = qanda$Hair, b = qanda$Siblings)

## $`test statistics`
## [1] 12.03986
## 
## $`degree of freedom`
## [1] 24
## 
## $`p-value`
## [1] 0.979

The fisher’s exact test:

fisher.test (x = Hair_short, y = Siblings_short)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  Hair_short and Siblings_short
## p-value = 0.2876
## alternative hypothesis: two.sided

Exercise 5.5.

Work through the general process for hypothesis testing given in Section 5.1 using the data in Table 5.1.
Create a function which takes such a table (of any number of columns) as input and produces the same output as Exercise 5.2.
Work out how to use chisq.test() to perform a goodness of fit test and check the answer from that with the output from your function.
Have we proved that parents stop having children after the first girl? Why?

General process

The following is a six-step general process for hypothesis testing and p-values:

Define your hypotheses: null, \(H_0\), and alternative, \(H_1\)
Specify the statistical test under \(H_0\)
State and check any assumptions you are making
Calculate the test statistic to be used in the test
Work out the p-value corresponding to the test statistic
Draw conclusions in terms of the wording of \(H_0\)

1. Define the hypothesis: \(H_0\) and \(H_1\)

\(H_0\): The Geometric (0.5) model is a good fit.

\(H_1\): The Geometric (0.5) model is not a good fit.

2. Specify the statistical test under \(H_0\)

Statistical test:

\(\chi^2\) test

3. State and check any assumptions you are making

The probability of a female birth is 0.5;
We will ignore multiple births.

4. Calculate the test statistic to be used in the test

We can first look at the table

##     Observed Expected
## 0        100    115.0
## 1         90     57.5
## 2         28     28.8
## 3          7     14.4
## >=4        5     14.4

We need \((O_i - E_i)^2/E_i\) for the test statistic

Using the function!

hyp_test <- function (a) {
  O <- a $ Observed
  E <- a $ Expected
  
  C <- ((O - E)^2) / E
  
  chi2 <- sum (C)
  
  D <- 1 * (length(Observed) - 1)
  
  p_value <- 1 - pchisq (q = chi2, df = D)
  
  return (list ("test statistics" = chi2,
                "degree of freedom" = D,
                "p-value" = p_value))
}

5. Work out p-value and the corresponding test statistic

hyp_test (a = T)

## $`test statistics`
## [1] 30.2872
## 
## $`degree of freedom`
## [1] 4
## 
## $`p-value`
## [1] 4.277784e-06

6 Draw conclusions in terms of the wording of \(H_0\)

How we interpret them? If we have 5 % significance level.

qchisq(p = 0.95, df = 4)

## [1] 9.487729

The critical value is 9.487729.

Since 30.2872 > 9.487729

There is sufficient evidence to reject \(H_0\).

There is sufficient evidence to state that a geometric (0.5) distribution is not a good model.

6 Draw conclusions in terms of the wording of \(H_0\)

Or, we can use the p-value: 4.277784e-06

(The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, under the assumption that the null hypothesis is true.)

This means: the observed data is unlikely to have occurred if the null hypothesis were true.

Therefore, we have sufficient evidence to reject the null hypothesis at a 0.05 significance level.

-> Conclusion: the geometric (0.5) distribution is not a good model.

How to use chisq.test() to perform a goodness of fit test?

chisq.test(T$Observed, p = T$Expected / sum(T$Expected))

## 
##  Chi-squared test for given probabilities
## 
## data:  T$Observed
## X-squared = 30.3, df = 4, p-value = 4.252e-06

Have we proved that parents stop having children after the first girl?

In the previous part, we have showed that: geometric (0.5) distribution is not a good model.

What does it means?

Family planning decisions might be influenced by factors other than the gender of the child.

Exercise 5.6

Download the clinical_trials.csv data set. In this synthetic data set, a company Badatta claims that they have developed a new drug Kilvir that can kill the common cold virus and have conducted clinical trials to prove this claim.

Here, each row in the data set represents data for one patient, including their

age group,
whether they have been treated with the Kilvir drug or a placebo (which has no therapeutic effect), and
whether the patient feels the drug has effectively improved their conditions.

Exercise 5.6

Extract a subset of the data containing only patients in the 31-35 age group.
Construct a contingency table of treatment drug versus treatment effect for this age group.
Is there evidence that Kilvir is effective in this age group?

Now have a look at the other age groups.

Will you be concerned if Badatta only publishes their clinical trial results in the 31-35 age group?

Download the dataset:

ct <- read.csv ("clinical_trials.csv")

Take a look at it:

head (ct)

##   PatientID AgeGroup    Drug      Result
## 1        P1    56-60  Kilvir   Effective
## 2        P2    26-30 Placebo   Effective
## 3        P3    61-65 Placebo Ineffective
## 4        P4    61-65  Kilvir   Effective
## 5        P5    16-20 Placebo   Effective
## 6        P6    81-85 Placebo   Effective

What should be my null hypothesis?

\(H_0\): the new drug Kilvir is not effective.

\(H_1\): the new drug Kilvir is effective.

Mostly, we put what we want to prove in the alternative hypothesis.

Extract age group: 31-35:

Here’s the first five age groups shown:

table (ct$AgeGroup) [1:5]

## 
## 16-20 21-25 26-30 31-35 36-40 
##    68    46    84    34    52

Age group 31-35 is in the fourth position, so name it D.

ctrowD <- which(ct$AgeGroup == "31-35")

Construct a contingenccy table of treatment drug versus treatment effect for this age group.

I make an individual table for those aged between 31-35 first.

dfD <- ct [ ctrowD, ]

contD <- table (dfD$Drug, dfD$Result)

contD

##          
##           Effective Ineffective
##   Kilvir          9           8
##   Placebo         2          15

Effective in this age group?

chisq.test(contD)$p.value

## [1] 0.02784006

fisher.test(contD)$p.value

## [1] 0.02550983

If we have 5% of significance level, then there is sufficient evidence to reject \(H_0\) that there the new drug Kilvin is effective

But if we only have significance level of 1% (in the case of testing drugs), then there is insufficient evidence to reject \(H_0\), it is not effective.

Odds ratio?

## odds ratio 
##   7.882191

An odds ratio greater than 1 indicates a positive association.

odds ratio of 7.88 indicates a strong positive relationship.

Given that I use the different drugs (drugs are explanatory variable) what is the probability that it is effective (effects as the response variable)

The results states that Kilvir is about 8 times more effective than Placebo for this certain age group.

Okay if only publish result from 31-35-year-olds?

Because publishing results for only one age group may not provide a representative picture of the overall population;

It might give a misleading impression of the overall findings if results are cherry-picked without considering the broader context.

Results from one age group may be confounded by other variables.

If it just publish the 31-35 age group data, we can see that p-value is not very low. When the population (people who take Kilvir) grows, there will be more issues.

Further looking: other age groups

what about their performances?

The youngest TWO age group:

##          
##           Effective Ineffective
##   Kilvir         18          39
##   Placebo        24          33

Perform two tests:

chisq.test(contAB)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contAB
## X-squared = 0.94246, df = 1, p-value = 0.3316

Perform two tests:

fisher.test(contAB)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contAB
## p-value = 0.3317
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.2737906 1.4623951
## sample estimates:
## odds ratio 
##  0.6371842

The eldest TWO age group:

##          
##           Effective Ineffective
##   Kilvir          4           9
##   Placebo         4           9

Perform two tests:

chisq.test(contNO)

## 
##  Pearson's Chi-squared test
## 
## data:  contNO
## X-squared = 0, df = 1, p-value = 1

Perform two tests:

fisher.test(contNO)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contNO
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1373006 7.2832873
## sample estimates:
## odds ratio 
##          1

Let’s see all age groups together

test <- function(age){
  
  ctrowi <- which(ct$AgeGroup == age)
  
  dfi <- ct [ ctrowi, ]
  conti <- table (dfi$Drug, dfi$Result)
  
  chi <- chisq.test(conti)
  fisher <- fisher.test(conti)
  
  return (list ("p-value (chi-squared)" = chi$p.value,
                "p-value (fisher)" = fisher$p.value)
  )
}

Let’s see all age groups together

age.index <- c ("16-20", "26-30", "31-35", "36-40",
                "41-45", "46-50", "51-55", "56-60",  
                "61-65", "66-70", "71-75", "76-80", 
                "81-85", "86-90")  # different ages

p.vec <- sapply(age.index, test)  
     # apply chi-squared test for each age group

Let’s see all age groups together

##       p-value (chi-squared) p-value (fisher)
## 16-20 0.1240666             0.1231997       
## 26-30 0.6403202             0.64078         
## 31-35 0.02784006            0.02550983      
## 36-40 0.3482841             0.3487349       
## 41-45 0.7125793             0.713982        
## 46-50 1                     1               
## 51-55 1                     1               
## 56-60 0.7300697             0.7310941       
## 61-65 0.3906743             0.3911247       
## 66-70 1                     1               
## 71-75 0.6932818             0.6945843       
## 76-80 0.3990752             0.4003231       
## 81-85 0.6170751             0.6199095       
## 86-90 0.4142162             0.4285714

Conclusion

Other groups (except 31 - 35) have much higher p-value.

In three age-groups, the Kilvin is statistically ineffective.

Other age groups have p-values much greater than 10%, even some with 70%.

There is insufficient evidence to reject \(H_0\): the new drug Kilvir is not effective.

Well, it is ineffective… when considering people of all age groups.