Inference on Proportions using Mathematical Models

Introduction

In this lab, I will summarize, analyze and interpret some data on the individuals who participated in the study described in the following paper:

Almotwaa, S., Elrobh, M., AbdulKarim, H., Alanazi, M., Aldaihan, S., Shaik, J., Arafa, M. and Warsy, A.S., 2018. Genetic polymorphism and expression of HSF1 gene is significantly associated with breast cancer in Saudi females. PLoS One, 13(3), p.e0193095.

Direct link to this assignment is University of Toronto.

Also, you can download dataset here

Importing Data

The following code reads in the raw data and stores it in an R dataset called Lab3data. Running R code so that the dataset is read in and ready to work with.

# Read in & glimpse data set
Lab3data<- read_csv("Almotwaa_2018_dataset.csv",show_col_types =FALSE )

#glimpse(Lab3data)
head(Lab3data)

“Breast cancer is the most frequently encountered amongst females and is a leading cause of cancer-related deaths all over the world and in Saudi Arabia*” (Almotwaa et al, 2018). Heat Shock Factor 1 (HSF1) is a protein that enhances the survival, spread and proliferation of malignant cells, with higher level of HSF-1 being linked to increased levels of cancer-related mortality. Almotwaa et al. hypothesized that single nucleotide polymorphisms (SNPs) in the HSF1 gene might affect its expression or function, which might have an effect on the development of breast cancer.

Section A

Since very few studies are reported in literature on the association between polymorphisms in HSF1 and risk of breast cancer, several SNPs located on the HSF1 gene were studied including rs78202224 (G>T), which is a non-synonymous mutation, located on exon 9.

a. What type of study did Almotwaa et al. conduct? How might the study design impact results? Briefly describe this potential impact(s) in the context of the study .

Almotwaa et al conducted observational study This type of study design involves observing and analyzing existing data without any intervention or manipulation by the researchers. Here are some potential impacts of this study design on the results:

Causality Inference Limitation: Observational studies cannot establish causality. While Almotwaa et al. found an association between HSF1 gene polymorphism and breast cancer, they cannot definitively conclude that the gene directly causes breast cancer. Other factors may be involved.

Confounding Variables: Observational studies are susceptible to confounding variables—factors that may influence both the exposure (HSF1 gene polymorphism) and the outcome (breast cancer). These confounders can distort the observed association. Almotwaa et al. likely adjusted for known confounders, but unmeasured or unknown confounders could still impact the results.

Selection Bias: The choice of participants in observational studies can introduce bias. Almotwaa et al. focused on Saudi females with breast cancer, which may not represent the broader population. If the sample is not representative, the results may not generalize well.

Retrospective Nature: Almotwaa et al. likely analyzed existing data retrospectively. This means they relied on historical information, medical records, or genetic data. Retrospective studies may suffer from incomplete or inaccurate data, affecting the validity of findings.

Gene-Environment Interaction: Observational studies often lack detailed information on environmental factors. The impact of gene-environment interactions (such as lifestyle, diet, or exposure to carcinogens) on breast cancer risk may not be fully explored.

b. Since T SNPs are associated with increased risk of breast cancer, the authors were interested in the association between combined rs78202224 TT and GT genotypes compared to GG genotypes with breast cancer. In Table 3 (See TT+GT row, p 6, Almotwaa et al., 2018), the authors report a p-value of 0.035 and a test statistic of 4.44 for the test they conducted on this association. What statistical procedure did they use? Reproduce their results.

Chi-Square test was used

Note the measurements in rs7820224.First we need to combine the observations with GT and TT genotypes before performing analysis.

Lab3data$combined<- ifelse(Lab3data$rs78202224=="GG","GG","GT_TT")
table(Lab3data$combined)

## 
##    GG GT_TT 
##   230    11

chisq.test(table(Lab3data$combined))

## 
##  Chi-squared test for given probabilities
## 
## data:  table(Lab3data$combined)
## X-squared = 199.01, df = 1, p-value < 2.2e-16

c. State the null and alternative hypotheses for the hypothesis test that the authors conducted. Be sure to describe any notation you use in the context of the study.

Null Hypothesis (H₀):
- The null hypothesis represents the default assumption that there is no significant association between the combined TT+GT genotypes (rs78202224) and breast cancer risk.
- In notation:
  - \(H₀\): The proportion of breast cancer cases is the same for GG and GT+TT genotypes (i.e., \(p_{\text{GG}} = p_{\text{GT+TT}}\)).
Alternative Hypothesis (Hₐ):
- The alternative hypothesis suggests that there is a significant association between the combined TT+GT genotypes and breast cancer risk.
- In notation:
  - \(Hₐ\): The proportion of breast cancer cases differs between GG and GT+TT genotypes (i.e., \(p_{\text{GG}} \neq p_{\text{GT+TT}}\)).

d. Assess all conditions of the procedure from question 2b-c and comment on the appropriateness of this procedure for these data. If it is an appropriate procedure, interpret the test results using a 5% significance level.

Conditions and interpretation of the results for the chi-squared test of independence in the context of estimating the difference in the proportions of GT/TT genotypes for PR- and PR+ breast cancer patients.

Chi-Squared Test of Independence

The chi-squared test of independence assesses whether there is an association between two categorical variables. In your case, we want to determine if the proportions of GT/TT genotypes are different for PR- and PR+ breast cancer patients.

Conditions for the Chi-Squared Test of Independence:

Independence Assumption:
- The observations must be independent. This means that the data should not come from a paired or matched design.
- In your study, we assume that the genotypes (GT/TT) are independent of PR status (PR- or PR+).
Sample Size:
- Each cell in the contingency table (cross-tabulation) should have an expected frequency of at least 5.
- We’ll check this condition when we calculate the expected frequencies.
Random Sampling:
- The data should be collected using random sampling or a well-defined sampling process.
- Ensure that the data were collected without bias.

Interpretation of Results:

We’ll perform the chi-squared test using a 5% significance level (α = 0.05).
If the p-value is less than 0.05, we reject the null hypothesis, indicating evidence of an association between genotypes and PR status.
If the p-value is greater than or equal to 0.05, we fail to reject the null hypothesis, suggesting no significant association.

Let’s proceed with the chi-squared test using your actual data to assess whether the proportions of GT/TT genotypes differ for PR- and PR+ breast cancer patients..

SECTION B

Investigating the SNP rs7320224 and its association with the status of hormone receptors, including progesterone receptor (PR) status, could also be important for guiding breast cancer treatment outcomes.

(i) Choosing the most appropriate graphical summary (or summaries) to describe the association between ProgesteroneReceptor_Status and rs7320224 genotypes (i.e., GG or combined GT/TT) in the breast cancer group. and Justifying choice(s).

Appropriate Graphical Summary:

A stacked bar chart would be the most suitable choice for visualizing the association between these categorical variables. Here’s why:

Stacked Bar Chart:
- A stacked bar chart allows us to display the distribution of genotypes (GG and GT/TT) within each PR status category (PR- and PR+).
- The x-axis represents PR status (two bars: PR- and PR+), and the y-axis represents the proportion of each genotype.
- Each bar is divided into segments representing GG and GT/TT genotypes.
- By stacking the segments, we can compare the proportions of genotypes within each PR status group.
Justification:
- The stacked bar chart effectively shows the relationship between two categorical variables while preserving the total count (100%) for each PR status group.
- It allows us to visually assess whether there are differences in genotype proportions between PR- and PR+ breast cancer patients.
Interpretation:
- If the segments within each bar are noticeably different in length, it suggests an association between PR status and rs7320224 genotypes.
- We can easily compare the proportions of GT/TT genotypes relative to GG genotypes for both PR- and PR+ groups.

(ii) Producing summary (or summaries) and interpret it in the context of this study.

Note the measurements. You will first need to filter the observations to include only those the breastcancer group to create your summary (or summaries).

breast_cancer_data <- Lab3data %>%
  filter(group == "breastcancer")

# Create a contingency table of PR status and rs7320224 genotypes
cont_table <- table(breast_cancer_data$ProgesteroneReceptor_Status, breast_cancer_data$rs78202224)



# Convert the table to a data frame for easier plotting
cont_df <- as.data.frame.matrix(cont_table)

# Reshape the data for ggplot
cont_df <- cont_df %>%
  gather(key = "rs7820224", value = "count", -1)

# Assuming you've already loaded the "Lab3data" dataset

# Filter observations for the breast cancer group
breast_cancer_data <- subset(Lab3data, group == "breastcancer")

# Create a contingency table for PR status vs. rs7320224 genotypes
contingency_table <- table(breast_cancer_data$ProgesteroneReceptor_Status, breast_cancer_data$rs78202224)

# Calculate proportions within each PR status group
prop_GG <- contingency_table["PR+", "GG"] / sum(contingency_table["PR+", ])
prop_GT_TT <- contingency_table["PR+", "TT"] / sum(contingency_table["PR+", ])

# Create a stacked bar chart
barplot(
  rbind(prop_GG, prop_GT_TT),
  beside = TRUE,
  names.arg = c("PR+", "PR-"),
  col = c("lightblue", "lightgreen"),
  main = "Association between PR Status and rs7320224 Genotypes",
  xlab = "PR Status",
  ylab = "Proportion",
  legend.text = c("GG", "GT/TT")
)

# Interpretation:
# If the segments within each bar differ noticeably, it suggests an association.
# Compare the proportions of GT/TT genotypes relative to GG genotypes for both PR- and PR+ groups.

By examining the stacked bar chart, you can visually assess whether there’s any association between Progesterone Receptor status and rs7320224 genotypes in the breast cancer group. For example, you can observe if certain genotypes are more prevalent in specific PR status categories. This analysis helps in understanding potential relationships between genetic variations (rs7320224) and hormone receptor status (PR) in breast cancer patients.

b. Suppose they wish to estimate the difference in the proportions of GT/TT genotypes for PR- and PR+ breast cancer patients. Is a large sample approximate Z CI for \(p_1-p_2\) appropriate for these data? Assess all conditions of this procedure. If this is not an appropriate procedure, propose an alternative procedure.

Note: Include only the individuals with reported status/genotype.

To assess the difference in proportions of GT/TT genotypes between PR- and PR+ breast cancer patients, we can consider constructing a confidence interval (CI) for the difference in proportions (\(p_1 - p_2\)). However, before proceeding with a large sample approximate Z CI, let’s assess the conditions for its appropriateness:

Conditions for Large Sample Approximate Z CI:

Independence Assumption:
- The observations should be independent within each group (PR- and PR+).
- We assume that the individuals’ PR status and genotypes are independent.
Sample Size:
- The sample size should be sufficiently large for the normal approximation to be valid.
- We’ll check this condition when calculating the standard error.
Success-Failure Condition:
- The number of successes and failures in each group should be at least 10.
- We’ll check this condition based on the observed proportions.

Alternative Procedure:

If the success-failure condition is not met, we can consider using a confidence interval based on the exact binomial distribution (Clopper-Pearson interval) instead of the large sample approximate Z CI.

(i) Using a 95% confidence level, compute an appropriate interval estimate using the procedure in Question 2b based on these data (i.e., either using the large sample approximate Z CI, or if not appropriate, the alternative procedure you proposed).
(ii) Interpret your interval estimate ?

# Assuming you've already loaded the "Lab3data" dataset

# Filter observations for PR+ and PR- groups
PR_plus_data <- subset(Lab3data, ProgesteroneReceptor_Status == "PR+")
PR_minus_data <- subset(Lab3data, ProgesteroneReceptor_Status == "PR-")

# Calculate proportions
prop_PR_plus <- sum(PR_plus_data$combined == "GT_TT") / nrow(PR_plus_data)
prop_PR_minus <- sum(PR_minus_data$combined == "GT_TT") / nrow(PR_minus_data)

# Sample sizes
n_PR_plus <- nrow(PR_plus_data)
n_PR_minus <- nrow(PR_minus_data)

# Calculate standard error
SE <- sqrt((prop_PR_plus*(1-prop_PR_plus)/n_PR_plus) + (prop_PR_minus*(1-prop_PR_minus)/n_PR_minus))

# Critical value (for 95% CI)
Z <- qnorm(0.975)

# Calculate confidence interval
CI_lower <- prop_PR_plus - Z * SE
CI_upper <- prop_PR_plus + Z * SE

# Display results
cat("Proportion (PR+):", round(prop_PR_plus, 4), "\n")

## Proportion (PR+): 0.0568

cat("Proportion (PR-):", round(prop_PR_minus, 4), "\n")

## Proportion (PR-): 0.0862

cat("95% Confidence Interval:", round(CI_lower, 4), "to", round(CI_upper, 4), "\n")

## 95% Confidence Interval: -0.0301 to 0.1437

The estimated difference in proportions (\(p_1 - p_2\)) is approximately 0.0568 (PR+ proportion) minus 0.0862 (PR- proportion), which results in a negative value.
The 95% confidence interval (-0.0301 to 0.1437) suggests that we are 95% confident that the true difference in proportions lies within this range.
Since the interval includes zero, we cannot conclude a significant difference in proportions between PR+ and PR- breast cancer patients based on this data.

In other words, we do not have strong evidence to suggest that the proportions of GT/TT genotypes differ significantly between the two PR status groups.