Exercise 5.1

make a table

##       data_2
## data_1 Black Blonde Brown  Grey   Red
##      0  1.54  -0.49 -1.66  1.14  0.19
##      1 -1.08   0.44  1.01 -0.80  0.32
##      2 -0.32  -0.32  0.71 -0.37 -0.53
##      3 -0.15  -0.60  0.58 -0.18 -0.25
##      4 -1.61   2.33  0.58 -0.12 -0.16
##      5  0.60  -0.22 -0.50 -0.07 -0.09
##      6  0.60  -0.22 -0.50 -0.07 -0.09
  • In the Hair and Siblings residuals table, how many associations are we examining?
## There are 35 relationships.
  • Are 95% of them within two standard deviations of the mean?

I make a function:

miu <- mean(SHtable)

sdev <- sd(SHtable)

lower_bound <- miu - 2 * sdev
upper_bound <- miu + 2 * sdev

within_2sd <- SHtable [SHtable >= lower_bound & SHtable <= upper_bound]

number_within <- length(within_2sd)
perc_within <- number_within / length(SHtable)

percentage_check <- function (a , b) {
  d = a - b
  if (d > 0) 
  {
    cat ("95 % of them within two standard deviations of the mean.")    }
  else {
    cat ("**LESS THAN** 95 % of them within two standard deviations of the mean.")    }
}

plug in the values of our data:

percentage_check (a = perc_within,
                  b = 0.95)
## **LESS THAN** 95 % of them within two standard deviations of the mean.
  • Exactly what % of a Normal distribution is within two standard deviations of the mean?
pnorm (q = 2, mean = 0, sd = 1) - pnorm (-2, 0, 1)
## [1] 0.9544997
# or

2 * pnorm (q = 2, mean = 0, sd = 1) - 1
## [1] 0.9544997

Exercise 5.2

Write a function that takes two variables for the same set of objects as arguments and prints out the:

Test it against Siblings and Hair.

Do you expect there to be any difference if you change the order of the variables? Run the function against Hair and Fish. Write down your conclusion.

I create a function:

hyp_test <- function (a, b) {
  
  obs_table <- table (a, b) # table of observed values
  
  niplus <- rowSums (obs_table)
  nplusj <- colSums (obs_table)
  n <- sum(obs_table)
  
  exp_table <- niplus %*% t(nplusj) /n # table of expected values
  
  test_stat <- sum(
    (obs_table - exp_table)^2
    /
      exp_table
  )                       # this outputs the test statistic
  
  d <- ( nrow (obs_table) - 1) * (ncol (obs_table) - 1)
  
  p_value <- 1 - pchisq (q = test_stat, df = d)
  p_value <- round (p_value, 3)    # round to 3 d.p.
  
  return (list ("test statistics" = test_stat,
                "degree of freedom" = d,
                "p-value" = p_value))
}
  • Test it against Siblings and Hair
hyp_test (a = qanda$Siblings, b = qanda$Hair)
## $`test statistics`
## [1] 12.03986
## 
## $`degree of freedom`
## [1] 24
## 
## $`p-value`
## [1] 0.979
  • What about changing the order? – no difference
hyp_test (a = qanda$Hair, b = qanda$Siblings)
## $`test statistics`
## [1] 12.03986
## 
## $`degree of freedom`
## [1] 24
## 
## $`p-value`
## [1] 0.979

Run the function against Hair and Fish

hyp_test (a = qanda$Handspan, b = qanda$Fish)
## $`test statistics`
## [1] 88.10786
## 
## $`degree of freedom`
## [1] 103
## 
## $`p-value`
## [1] 0.852
  • Conclusion: Hair color and Siblings are independent.

Exercise 5.3.

Name of the function

# fisher.test()

run it against our Siblings and Hair data

SH_fisher_test <- fisher.test (x = qanda$Siblings,
                               y = qanda$Hair,
                               workspace = 500000,
                               conf.level = 0.95,
                               simulate.p.value = TRUE)

SH_fisher_test
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  qanda$Siblings and qanda$Hair
## p-value = 0.6712
## alternative hypothesis: two.sided

Hair and Fish

HF_fisher_test <- fisher.test (x = qanda$Hair,
                               y = qanda$Fish)
HF_fisher_test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  qanda$Hair and qanda$Fish
## p-value = 0.9108
## alternative hypothesis: two.sided
  • Notes:

  • This function is used to assess the association between two categorical variables when the sample size is small.

  • Fisher’s exact test is generally used for datasets with small sample sizes (< 1000), especially when the frequency count is < 5 for more than 20% of cells . The chi-square test should be used for datasets with large sample sizes.

  • Fisher’s exact test is an exact test which returns exact p value that represents the probability of obtaining the observed distribution assuming that the variables are independent.

Exercise 5.4.

The given code:

Hair_short <- rep ("Other", times = nrow (qanda)) # I first replace all hair color to Other
Hair_short [qanda$Hair == "Black"] <- "Black" # Then I retain black color only.

Hair_short <- as.factor (Hair_short)

Siblings_short <- rep (">1", times = nrow (qanda))  # I first replace all numbers with > 1
Siblings_short [qanda$Siblings == 0] <- "0" # I retain those with 0 Siblings
Siblings_short [qanda$Siblings == 1] <- "1" # I also retain those with 1 sibling

Siblings_short <- as.factor (Siblings_short)

let’s take a look at this updated table

##               Hair_short
## Siblings_short Black Other
##             >1    28    12
##             0     79    21
##             1     63    27

the original table

##    
##     Black Blonde Brown Grey Red
##   0    79      4    15    1   1
##   1    63      5    21    0   1
##   2    20      1     7    0   0
##   3     5      0     2    0   0
##   4     1      1     1    0   0
##   5     1      0     0    0   0
##   6     1      0     0    0   0

Run the function for this new table

hyp_test (a = Hair_short, b = Siblings_short)
## $`test statistics`
## [1] 2.374412
## 
## $`degree of freedom`
## [1] 2
## 
## $`p-value`
## [1] 0.305

compare with the original and the Fisher’s

The original table

hyp_test (a = qanda$Hair, b = qanda$Siblings)
## $`test statistics`
## [1] 12.03986
## 
## $`degree of freedom`
## [1] 24
## 
## $`p-value`
## [1] 0.979

The fisher’s exact test:

fisher.test (x = Hair_short, y = Siblings_short)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  Hair_short and Siblings_short
## p-value = 0.2876
## alternative hypothesis: two.sided

Exercise 5.5.

General process

The following is a six-step general process for hypothesis testing and p-values:

  1. Define your hypotheses: null, \(H_0\), and alternative, \(H_1\)

  2. Specify the statistical test under \(H_0\)

  3. State and check any assumptions you are making

  4. Calculate the test statistic to be used in the test

  5. Work out the p-value corresponding to the test statistic

  6. Draw conclusions in terms of the wording of \(H_0\)

1. Define the hypothesis: \(H_0\) and \(H_1\)

\(H_0\): The Geometric (0.5) model is a good fit.

\(H_1\): The Geometric (0.5) model is not a good fit.

2. Specify the statistical test under \(H_0\)

Statistical test:

  • \(\chi^2\) test

3. State and check any assumptions you are making

  • The probability of a female birth is 0.5;

  • We will ignore multiple births.

4. Calculate the test statistic to be used in the test

We can first look at the table

col_names <- c ("Observed", "Expected")
row_names <- c ("0", "1", "2", "3", ">=4")

Observed <- c (100, 90, 28, 7, 5)

Expected0to3 <- c (length(qanda$Siblings)
               *
                dgeom ((x = 0:3), prob = 0.5))
Expected_4to6 <-  sum(length(qanda$Siblings)
                     *
                       pgeom (q = 3, prob = 0.5, lower.tail = FALSE))

matrix_data <- matrix (c (Observed, Expected0to3, Expected_4to6), 
                       nrow = 2, byrow = TRUE)

matrix_data <- round (matrix_data , 1)

T <- as.data.frame (t(matrix_data))

rownames (T) <- row_names
colnames (T) <- col_names

T
##     Observed Expected
## 0        100    115.0
## 1         90     57.5
## 2         28     28.8
## 3          7     14.4
## >=4        5     14.4
  • We need \((O_i - E_i)^2/E_i\) for the test statistic

  • Using the function!

hyp_test <- function (a) {
  O <- a $ Observed
  E <- a $ Expected
  
  C <- ((O - E)^2) / E
  
  chi2 <- sum (C)
  
  D <- 1 * (length(Observed) - 1)
  
  p_value <- 1 - pchisq (q = chi2, df = D)
  
  return (list ("test statistics" = chi2,
                "degree of freedom" = D,
                "p-value" = p_value))
}

5. Work out p-value and the corresponding test statistic

hyp_test (a = T)
## $`test statistics`
## [1] 30.2872
## 
## $`degree of freedom`
## [1] 4
## 
## $`p-value`
## [1] 4.277784e-06

6.a Draw conclusions in terms of the wording of \(H_0\)

How we interpret them? If we have 5 % significance level.

qchisq(p = 0.95, df = 4)
## [1] 9.487729
  • The critical value is 9.487729.

  • Since 30.2872 > 9.487729

  • There is sufficient evidence to reject \(H_0\).

  • There is sufficient evidence to state that a geometric (0.5) distribution is not a good model.

6.b Draw conclusions in terms of the wording of \(H_0\)

Or, we can use the p-value: 4.277784e-06

(The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, under the assumption that the null hypothesis is true.)

  • This means: the observed data is unlikely to have occurred if the null hypothesis were true.

  • Therefore, we have sufficient evidence to reject the null hypothesis at a 0.05 significance level.

  • Conclusion: the geometric (0.5) distribution is not a good model.

How to use chisq.test() to perform a goodness of fit test?

chisq.test(T$Observed, p = T$Expected / sum(T$Expected))
## 
##  Chi-squared test for given probabilities
## 
## data:  T$Observed
## X-squared = 30.3, df = 4, p-value = 4.252e-06

Have we proved that parents stop having children after the first girl?

  • In the previous part, we have showed that: geometric (0.5) distribution is not a good model.

What does it means?

  • Family planning decisions might be influenced by factors other than the gender of the child.

Exercise 5.6

Download the clinical_trials.csv data set. In this synthetic data set, a company Badatta claims that they have developed a new drug Kilvir that can kill the common cold virus and have conducted clinical trials to prove this claim.

Here, each row in the data set represents data for one patient, including their

  1. Extract a subset of the data containing only patients in the 31-35 age group.

  2. Construct a contingency table of treatment drug versus treatment effect for this age group.

  3. Is there evidence that Kilvir is effective in this age group?

Now have a look at the other age groups.

  1. Will you be concerned if Badatta only publishes their clinical trial results in the 31-35 age group?

Download the dataset:

ct <- read.csv ("clinical_trials.csv")

Take a look at it:

head (ct)
##   PatientID AgeGroup    Drug      Result
## 1        P1    56-60  Kilvir   Effective
## 2        P2    26-30 Placebo   Effective
## 3        P3    61-65 Placebo Ineffective
## 4        P4    61-65  Kilvir   Effective
## 5        P5    16-20 Placebo   Effective
## 6        P6    81-85 Placebo   Effective

What should be my null hypothesis?

  • \(H_0\): the new drug Kilvir is not effective.

  • \(H_1\): the new drug Kilvir is effective.

  • Mostly, we put what we want to prove in the alternative hypothesis.

Extract age group: 31-35:

Here’s the first five age groups shown:

table (ct$AgeGroup) [1:5]
## 
## 16-20 21-25 26-30 31-35 36-40 
##    68    46    84    34    52
  • Age group 31-35 is in the fourth position, so name it D.
ctrowD <- which(ct$AgeGroup == "31-35")

Construct a contingenccy table of treatment drug versus treatment effect for this age group.

  • I make an individual table for those aged between 31-35 first.
dfD <- ct [ ctrowD, ]

contD <- table (dfD$Drug, dfD$Result)

contD
##          
##           Effective Ineffective
##   Kilvir          9           8
##   Placebo         2          15

Is there evidence that Kilvir is effective in this age group?

  • Are using different types of drugs independent from the results?
chisq.test(contD)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contD
## X-squared = 4.8379, df = 1, p-value = 0.02784
fisher.test(contD)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contD
## p-value = 0.02551
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.217552 92.482179
## sample estimates:
## odds ratio 
##   7.882191

If we have 5% of significance level, then there is insufficient evidence to reject that there is no relationship between using different types of drugs and the overall effect (effective / ineffective).

But if we only have significance level of 1%, then there is sufficient evidence to reject \(H_0\).

Odds ratio?

  • An odds ratio greater than 1 indicates a positive association.

  • odds ratio of 7.88 indicates a strong positive relationship.

  • Given that I use the different drugs (drugs are explanatory variable) what is the probability that it is effective (effects as the response variable)

  • The results states that Kilvir is about 8 times more effective than Placebo for this certain age group.

Okay if only publish result from 31-35-year-olds?

No. 

  • Because publishing results for only one age group may not provide a representative picture of the overall population;

  • It might give a misleading impression of the overall findings if results are cherry-picked without considering the broader context.

  • Results from one age group may be confounded by other variables.

  • If it just publish the 31-35 age group data, we can see that p-value is not very low. When the population (people who take Kilvir) grows, there will be more issues.

Further looking: other age groups

  • what about their performances?

The youngest TWO age group:

##          
##           Effective Ineffective
##   Kilvir         18          39
##   Placebo        24          33

Perform two tests:

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contAB
## X-squared = 0.94246, df = 1, p-value = 0.3316
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contAB
## p-value = 0.3317
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.2737906 1.4623951
## sample estimates:
## odds ratio 
##  0.6371842

The eldest TWO age group:

##          
##           Effective Ineffective
##   Kilvir          4           9
##   Placebo         4           9

Perform two tests:

## 
##  Pearson's Chi-squared test
## 
## data:  contNO
## X-squared = 0, df = 1, p-value = 1
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contNO
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1373006 7.2832873
## sample estimates:
## odds ratio 
##          1

Let’s see all age groups together

##       p-value (chi-squared) p-value (fisher)
## 16-20 0.1240666             0.1231997       
## 26-30 0.6403202             0.64078         
## 31-35 0.02784006            0.02550983      
## 36-40 0.3482841             0.3487349       
## 41-45 0.7125793             0.713982        
## 46-50 1                     1               
## 51-55 1                     1               
## 56-60 0.7300697             0.7310941       
## 61-65 0.3906743             0.3911247       
## 66-70 1                     1               
## 71-75 0.6932818             0.6945843       
## 76-80 0.3990752             0.4003231       
## 81-85 0.6170751             0.6199095       
## 86-90 0.4142162             0.4285714

Conclusion