Due date: 30/11/2025 (Sunday)
How to create the *.pdf file?
Step 1: Knit this *.rmd file as an html file
Step 2: Open the html file with your web browser
Step 3: From your web browser, save it as a *.pdf
file.
The gene expression data collected by Golub et al. (1999) are among
the classical in bioinformatics. The data are stored in
golub.txt, containing gene expression values of 3051 genes
(rows) from 38 leukemia patients (columns). Twenty-seven patients
(column 1 to 27) are diagnosed as acute lymphoblastic leukemia (ALL) and
eleven (column 28 to 38) as acute myeloid leukemia (AML). The tumor
class of ALL is 0 (negative), while the tumor class of AML is 1
(positive).
The important gene CD33 is among one of the investigated genes. It has its expression values in row 808 of the golub data. Suppose that normality of the ALL and AML expression values has been validated and assume equal variance. Test the equality of the means by an appropriate test about gene CD33. Formulate the null hypothesis, the p-value and your conclusion.
# Read the data file
golub <- read.table("golub.txt", header = TRUE, sep = "\t")
# Extract row 808 corresponding to gene CD33
cd33 <- as.numeric(golub[808, ])
# Split into ALL and AML
all_vals <- cd33[1:27]
aml_vals <- cd33[28:38]
# Perform two-sample t-test assuming equal variances
t_res <- t.test(all_vals, aml_vals, var.equal = TRUE)
# Display the test result
t_res
##
## Two Sample t-test
##
## data: all_vals and aml_vals
## t = -7.9813, df = 36, p-value = 1.773e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.5487898 -0.9211602
## sample estimates:
## mean of x mean of y
## -0.8812041 0.3537709
Null hypothesis (H₀):
The mean expression level of gene CD33 is the same between ALL and AML
patients.
\(H_0: \mu_{\text{ALL}} =
\mu_{\text{AML}}\)
Alternative hypothesis (H₁): The mean expression levels differ.
\(H_1: \mu_{\text{ALL}} \ne
\mu_{\text{AML}}\)
Using a two-sample t-test with equal variances, we obtained:
t-statistic: -7.981
df: 36
p-value: 1.77 × 10⁻⁹
mean(ALL): -0.8812
mean(AML): 0.3538
95% CI: [-1.548, -0.922]
Conclusion:
Because the p-value is far below 0.05, we reject the null
hypothesis.
CD33 expression differs significantly between ALL and AML patients.
Researchers conducted a study to investigate the relationship between a specific genetic mutation (A) and susceptibility to a rare neurological disorder (N). They collected data from a cohort of 150 individuals and summarized it in a 2x2 contingency table as follows:
Disorder (N)
Present Absent
--------- ------
Mutation (A) 25 60
No Mutation (A) 30 35
Perform a categorical analysis on this data using R and determine whether there is a significant association between the genetic mutation (A) and the occurrence of the disease (D) at a significance level of 0.05. Clearly state your steps of hypothesis testing to get full marks.
# Create the 2x2 contingency table
mat <- matrix(c(25, 60,
30, 35),
nrow = 2, byrow = TRUE)
rownames(mat) <- c("Mutation_A", "No_Mutation")
colnames(mat) <- c("Present", "Absent")
mat
## Present Absent
## Mutation_A 25 60
## No_Mutation 30 35
# Perform Pearson chi-square test
chi_res <- chisq.test(mat, correct = FALSE)
chi_res
##
## Pearson's Chi-squared test
##
## data: mat
## X-squared = 4.4459, df = 1, p-value = 0.03499
# Perform Fisher's exact test
fisher_res <- fisher.test(mat)
fisher_res
##
## Fisher's Exact Test for Count Data
##
## data: mat
## p-value = 0.04108
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.234163 1.008181
## sample estimates:
## odds ratio
## 0.4885218
Hypothesis Testing (Categorical Association)
Null hypothesis (H₀): There is no association between genetic
mutation A and the presence of the neurological disorder N.
The variables are independent.
Alternative hypothesis (H₁):
There is an association between mutation A and disorder N.
The variables are not independent.
Using Pearson’s chi-square test without Yates correction, we obtain:
Chi-square statistic:approximately 4.446
Degrees of freedom: 1
p-value: approximately 0.03499
Using Fisher’s exact test, we obtain:
Two-sided p-value: approximately 0.04108 Odds ratio: about 0.486
Interpretation: Both the Pearson chi-square test and Fisher’s exact
test return p-values below 0.05.
Therefore, we reject the null hypothesis at the 5% significance
level.
There is statistically significant evidence of an association between
mutation A and the neurological disorder N.