GNBF5030 Homework3

Due date: 30/11/2025 (Sunday)

How to create the *.pdf file?

Step 1: Knit this *.rmd file as an html file

Step 2: Open the html file with your web browser

Step 3: From your web browser, save it as a *.pdf file.

Question 1: Hypothesis testing

The gene expression data collected by Golub et al. (1999) are among the classical in bioinformatics. The data are stored in golub.txt, containing gene expression values of 3051 genes (rows) from 38 leukemia patients (columns). Twenty-seven patients (column 1 to 27) are diagnosed as acute lymphoblastic leukemia (ALL) and eleven (column 28 to 38) as acute myeloid leukemia (AML). The tumor class of ALL is 0 (negative), while the tumor class of AML is 1 (positive).

The important gene CD33 is among one of the investigated genes. It has its expression values in row 808 of the golub data. Suppose that normality of the ALL and AML expression values has been validated and assume equal variance. Test the equality of the means by an appropriate test about gene CD33. Formulate the null hypothesis, the p-value and your conclusion.

# Read the data file 
golub <- read.table("golub.txt", header = TRUE, sep = "\t")

# Extract row 808 corresponding to gene CD33
cd33 <- as.numeric(golub[808, ])

# Split into ALL and AML 
all_vals <- cd33[1:27]
aml_vals <- cd33[28:38]

# Perform two-sample t-test assuming equal variances
t_res <- t.test(all_vals, aml_vals, var.equal = TRUE)

# Display the test result
t_res

## 
##  Two Sample t-test
## 
## data:  all_vals and aml_vals
## t = -7.9813, df = 36, p-value = 1.773e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.5487898 -0.9211602
## sample estimates:
##  mean of x  mean of y 
## -0.8812041  0.3537709

Null hypothesis (H₀):
The mean expression level of gene CD33 is the same between ALL and AML patients.
\(H_0: \mu_{\text{ALL}} = \mu_{\text{AML}}\)

Alternative hypothesis (H₁): The mean expression levels differ.
\(H_1: \mu_{\text{ALL}} \ne \mu_{\text{AML}}\)

Using a two-sample t-test with equal variances, we obtained: t-statistic: -7.981
df: 36
p-value: 1.77 × 10⁻⁹
mean(ALL): -0.8812
mean(AML): 0.3538
95% CI: [-1.548, -0.922]

Conclusion:
Because the p-value is far below 0.05, we reject the null hypothesis.
CD33 expression differs significantly between ALL and AML patients.

Question 2: Categorical Analysis

Researchers conducted a study to investigate the relationship between a specific genetic mutation (A) and susceptibility to a rare neurological disorder (N). They collected data from a cohort of 150 individuals and summarized it in a 2x2 contingency table as follows:

                    Disorder (N)
                Present     Absent 
                ---------   ------
Mutation (A)        25      60 
No Mutation (A)     30      35

Perform a categorical analysis on this data using R and determine whether there is a significant association between the genetic mutation (A) and the occurrence of the disease (D) at a significance level of 0.05. Clearly state your steps of hypothesis testing to get full marks.

# Create the 2x2 contingency table
mat <- matrix(c(25, 60,
                30, 35),
              nrow = 2, byrow = TRUE)

rownames(mat) <- c("Mutation_A", "No_Mutation")
colnames(mat) <- c("Present", "Absent")

mat

##             Present Absent
## Mutation_A       25     60
## No_Mutation      30     35

# Perform Pearson chi-square test 
chi_res <- chisq.test(mat, correct = FALSE)
chi_res

## 
##  Pearson's Chi-squared test
## 
## data:  mat
## X-squared = 4.4459, df = 1, p-value = 0.03499

# Perform Fisher's exact test
fisher_res <- fisher.test(mat)
fisher_res

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mat
## p-value = 0.04108
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.234163 1.008181
## sample estimates:
## odds ratio 
##  0.4885218

Hypothesis Testing (Categorical Association)

Null hypothesis (H₀): There is no association between genetic mutation A and the presence of the neurological disorder N.
The variables are independent.

Alternative hypothesis (H₁):
There is an association between mutation A and disorder N.
The variables are not independent.

Using Pearson’s chi-square test without Yates correction, we obtain:

Chi-square statistic:approximately 4.446
Degrees of freedom: 1
p-value: approximately 0.03499

Using Fisher’s exact test, we obtain:

Two-sided p-value: approximately 0.04108 Odds ratio: about 0.486

Interpretation: Both the Pearson chi-square test and Fisher’s exact test return p-values below 0.05.
Therefore, we reject the null hypothesis at the 5% significance level.
There is statistically significant evidence of an association between mutation A and the neurological disorder N.