Introduction
In this lab, I will summarize, analyze and interpret some data on the individuals who participated in the study described in the following paper:
Almotwaa, S., Elrobh, M., AbdulKarim, H., Alanazi, M., Aldaihan, S., Shaik, J., Arafa, M. and Warsy, A.S., 2018. Genetic polymorphism and expression of HSF1 gene is significantly associated with breast cancer in Saudi females. PLoS One, 13(3), p.e0193095.
Direct link to this assignment is University of Toronto.
Also, you can download dataset here
Importing Data
The following code reads in the raw data and stores it in an R dataset called Lab3data. Running R code so that the dataset is read in and ready to work with.
# Read in & glimpse data set
Lab3data<- read_csv("Almotwaa_2018_dataset.csv",show_col_types =FALSE )
#glimpse(Lab3data)
head(Lab3data)
“Breast cancer is the most frequently encountered amongst females and is a leading cause of cancer-related deaths all over the world and in Saudi Arabia*” (Almotwaa et al, 2018). Heat Shock Factor 1 (HSF1) is a protein that enhances the survival, spread and proliferation of malignant cells, with higher level of HSF-1 being linked to increased levels of cancer-related mortality. Almotwaa et al. hypothesized that single nucleotide polymorphisms (SNPs) in the HSF1 gene might affect its expression or function, which might have an effect on the development of breast cancer.
Since very few studies are reported in literature on the association between polymorphisms in HSF1 and risk of breast cancer, several SNPs located on the HSF1 gene were studied including rs78202224 (G>T), which is a non-synonymous mutation, located on exon 9.
a. What type of study did Almotwaa et al. conduct? How might the study design impact results? Briefly describe this potential impact(s) in the context of the study .
Almotwaa et al conducted observational study This type of study design involves observing and analyzing existing data without any intervention or manipulation by the researchers. Here are some potential impacts of this study design on the results:
Causality Inference Limitation: Observational studies cannot establish causality. While Almotwaa et al. found an association between HSF1 gene polymorphism and breast cancer, they cannot definitively conclude that the gene directly causes breast cancer. Other factors may be involved.
Confounding Variables: Observational studies are susceptible to confounding variables—factors that may influence both the exposure (HSF1 gene polymorphism) and the outcome (breast cancer). These confounders can distort the observed association. Almotwaa et al. likely adjusted for known confounders, but unmeasured or unknown confounders could still impact the results.
Selection Bias: The choice of participants in observational studies can introduce bias. Almotwaa et al. focused on Saudi females with breast cancer, which may not represent the broader population. If the sample is not representative, the results may not generalize well.
Retrospective Nature: Almotwaa et al. likely analyzed existing data retrospectively. This means they relied on historical information, medical records, or genetic data. Retrospective studies may suffer from incomplete or inaccurate data, affecting the validity of findings.
Gene-Environment Interaction: Observational studies often lack detailed information on environmental factors. The impact of gene-environment interactions (such as lifestyle, diet, or exposure to carcinogens) on breast cancer risk may not be fully explored.
b. Since T SNPs are associated with increased risk of breast cancer, the authors were interested in the association between combined rs78202224 TT and GT genotypes compared to GG genotypes with breast cancer. In Table 3 (See TT+GT row, p 6, Almotwaa et al., 2018), the authors report a p-value of 0.035 and a test statistic of 4.44 for the test they conducted on this association. What statistical procedure did they use? Reproduce their results.
Chi-Square test was used
Note the measurements in rs7820224.First we need to combine the observations with GT and TT genotypes before performing analysis.
Lab3data$combined<- ifelse(Lab3data$rs78202224=="GG","GG","GT_TT")
table(Lab3data$combined)
##
## GG GT_TT
## 230 11
chisq.test(table(Lab3data$combined))
##
## Chi-squared test for given probabilities
##
## data: table(Lab3data$combined)
## X-squared = 199.01, df = 1, p-value < 2.2e-16
c. State the null and alternative hypotheses for the hypothesis test that the authors conducted. Be sure to describe any notation you use in the context of the study.
d. Assess all conditions of the procedure from question 2b-c and comment on the appropriateness of this procedure for these data. If it is an appropriate procedure, interpret the test results using a 5% significance level.
Conditions and interpretation of the results for the chi-squared test of independence in the context of estimating the difference in the proportions of GT/TT genotypes for PR- and PR+ breast cancer patients.
The chi-squared test of independence assesses whether there is an association between two categorical variables. In your case, we want to determine if the proportions of GT/TT genotypes are different for PR- and PR+ breast cancer patients.
Let’s proceed with the chi-squared test using your actual data to assess whether the proportions of GT/TT genotypes differ for PR- and PR+ breast cancer patients..
Investigating the SNP rs7320224 and its association with the status of hormone receptors, including progesterone receptor (PR) status, could also be important for guiding breast cancer treatment outcomes.
a.
(i) Choosing the most appropriate graphical summary (or summaries) to describe the association between ProgesteroneReceptor_Status and rs7320224 genotypes (i.e., GG or combined GT/TT) in the breast cancer group. and Justifying choice(s).
Appropriate Graphical Summary:
A stacked bar chart would be the most suitable choice for visualizing the association between these categorical variables. Here’s why:
(ii) Producing summary (or summaries) and interpret it in the context of this study.
Note the measurements. You will first need to filter the observations to include only those the breastcancer group to create your summary (or summaries).
breast_cancer_data <- Lab3data %>%
filter(group == "breastcancer")
# Create a contingency table of PR status and rs7320224 genotypes
cont_table <- table(breast_cancer_data$ProgesteroneReceptor_Status, breast_cancer_data$rs78202224)
# Convert the table to a data frame for easier plotting
cont_df <- as.data.frame.matrix(cont_table)
# Reshape the data for ggplot
cont_df <- cont_df %>%
gather(key = "rs7820224", value = "count", -1)
# Assuming you've already loaded the "Lab3data" dataset
# Filter observations for the breast cancer group
breast_cancer_data <- subset(Lab3data, group == "breastcancer")
# Create a contingency table for PR status vs. rs7320224 genotypes
contingency_table <- table(breast_cancer_data$ProgesteroneReceptor_Status, breast_cancer_data$rs78202224)
# Calculate proportions within each PR status group
prop_GG <- contingency_table["PR+", "GG"] / sum(contingency_table["PR+", ])
prop_GT_TT <- contingency_table["PR+", "TT"] / sum(contingency_table["PR+", ])
# Create a stacked bar chart
barplot(
rbind(prop_GG, prop_GT_TT),
beside = TRUE,
names.arg = c("PR+", "PR-"),
col = c("lightblue", "lightgreen"),
main = "Association between PR Status and rs7320224 Genotypes",
xlab = "PR Status",
ylab = "Proportion",
legend.text = c("GG", "GT/TT")
)
# Interpretation:
# If the segments within each bar differ noticeably, it suggests an association.
# Compare the proportions of GT/TT genotypes relative to GG genotypes for both PR- and PR+ groups.
By examining the stacked bar chart, you can visually assess whether there’s any association between Progesterone Receptor status and rs7320224 genotypes in the breast cancer group. For example, you can observe if certain genotypes are more prevalent in specific PR status categories. This analysis helps in understanding potential relationships between genetic variations (rs7320224) and hormone receptor status (PR) in breast cancer patients.
b. Suppose they wish to estimate the difference in the proportions of GT/TT genotypes for PR- and PR+ breast cancer patients. Is a large sample approximate Z CI for \(p_1-p_2\) appropriate for these data? Assess all conditions of this procedure. If this is not an appropriate procedure, propose an alternative procedure.
Note: Include only the individuals with reported status/genotype.
To assess the difference in proportions of GT/TT genotypes between PR- and PR+ breast cancer patients, we can consider constructing a confidence interval (CI) for the difference in proportions (\(p_1 - p_2\)). However, before proceeding with a large sample approximate Z CI, let’s assess the conditions for its appropriateness:
If the success-failure condition is not met, we can consider using a confidence interval based on the exact binomial distribution (Clopper-Pearson interval) instead of the large sample approximate Z CI.
(i) Using a 95% confidence level, compute an appropriate
interval estimate using the procedure in Question 2b based on
these data (i.e., either using the large sample approximate Z CI, or if
not appropriate, the alternative procedure you proposed).
(ii) Interpret your interval estimate ?
# Assuming you've already loaded the "Lab3data" dataset
# Filter observations for PR+ and PR- groups
PR_plus_data <- subset(Lab3data, ProgesteroneReceptor_Status == "PR+")
PR_minus_data <- subset(Lab3data, ProgesteroneReceptor_Status == "PR-")
# Calculate proportions
prop_PR_plus <- sum(PR_plus_data$combined == "GT_TT") / nrow(PR_plus_data)
prop_PR_minus <- sum(PR_minus_data$combined == "GT_TT") / nrow(PR_minus_data)
# Sample sizes
n_PR_plus <- nrow(PR_plus_data)
n_PR_minus <- nrow(PR_minus_data)
# Calculate standard error
SE <- sqrt((prop_PR_plus*(1-prop_PR_plus)/n_PR_plus) + (prop_PR_minus*(1-prop_PR_minus)/n_PR_minus))
# Critical value (for 95% CI)
Z <- qnorm(0.975)
# Calculate confidence interval
CI_lower <- prop_PR_plus - Z * SE
CI_upper <- prop_PR_plus + Z * SE
# Display results
cat("Proportion (PR+):", round(prop_PR_plus, 4), "\n")
## Proportion (PR+): 0.0568
cat("Proportion (PR-):", round(prop_PR_minus, 4), "\n")
## Proportion (PR-): 0.0862
cat("95% Confidence Interval:", round(CI_lower, 4), "to", round(CI_upper, 4), "\n")
## 95% Confidence Interval: -0.0301 to 0.1437
In other words, we do not have strong evidence to suggest that the proportions of GT/TT genotypes differ significantly between the two PR status groups.