In this module, explore the METABRIC breast cancer dataset, focusing on visualizing gene expression and clinical data to uncover potential patterns and relationships. We’ll demonstrate various data visualization techniques, including scatter plots, correlation matrices, and survival analysis, to provide a comprehensive understanding of the data. The goal is to use visual analytics to generate hypotheses for further research and to identify key prognostic indicators in breast cancer.
The METABRIC dataset, derived from a large-scale study of breast cancer patients, offers a rich source of data for modeling survival and other clinical outcomes. This module will guide you through several exploratory steps to make the most of this dataset.
Start by loading the necessary data files, including gene expression matrices and clinical information. These files are stored in .RData format, which is efficient for handling large datasets in R.
Load the data files and examine their contents.
load(file.path(data_dir, "metabric_disc_expr_df.RData"))
load(file.path(data_dir, "metabric_disc_clin_df.RData"))
load(file.path(data_dir, "metabric_expr_gene_info.RData"))
load(file.path(data_dir, "metabric_disc_expr_mat.RData"))
ls()
## [1] "data_dir" "gene_df" "metabric_disc_clin_df"
## [4] "metabric_disc_expr_df" "metabric_disc_expr_mat"
Check dimensions of the clinical data frame and the expression matrix to understand their size.
dim(metabric_disc_clin_df)
## [1] 997 32
dim(metabric_disc_expr_df)
## [1] 15834 999
Verify that the clinical data and expression data are aligned correctly by patient ID
match(metabric_disc_clin_df$metabric_id, colnames(metabric_disc_expr_mat))[1:10]
## [1] 1 2 3 4 5 6 7 8 9 10
Convert the expression data frame to a matrix format for easier manipulation.
metabric_disc_expr_mat[1:5, 1:5]
## MB_0362 MB_0346 MB_0386 MB_0574 MB_0185
## TSPAN6 7.579 7.819 7.366 8.110 5.993
## TNMD 6.211 5.557 8.108 5.789 5.484
## DPM1 9.150 9.794 9.229 9.672 10.159
## SCYL3 7.193 7.845 6.777 6.863 6.692
## C1orf112 7.835 8.084 6.895 6.987 7.173
aurka_row <- which(rownames(metabric_disc_expr_mat) == "AURKA")
aurka <- as.vector(metabric_disc_expr_mat[aurka_row, ])
# Scatter plot with linear model trend line
plot(metabric_disc_clin_df$T ~ aurka, xlab = "AURKA Expression", ylab = "Time to Death")
abline(lm(metabric_disc_clin_df$T ~ aurka), col = "blue")
This shows how higher AURKA expression is often associated with more aggressive tumor characteristics and poorer outcomes.
Correlation Matrix To explore correlations between multiple genes and clinical variables, we can create a correlation matrix. This helps identify clusters of highly correlated features, which might be biologically or clinically significant.
# Check available gene names
available_genes <- rownames(metabric_disc_expr_mat)
# Filter selected genes to only those available in the matrix
selected_genes <- c("AURKA", "BRCA1", "TP53", "ESR1", "HER2")
selected_genes <- selected_genes[selected_genes %in% available_genes]
# Subset expression matrix with available selected genes
expr_subset <- metabric_disc_expr_mat[selected_genes, ]
# Calculate correlation matrix
cor_matrix <- cor(t(expr_subset))
cor_matrix
## AURKA BRCA1 TP53 ESR1
## AURKA 1.00000000 0.33198665 0.02342025 -0.34356263
## BRCA1 0.33198665 1.00000000 -0.02931547 0.13781034
## TP53 0.02342025 -0.02931547 1.00000000 0.04989223
## ESR1 -0.34356263 0.13781034 0.04989223 1.00000000
# Visualize correlation matrix
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix, method = "circle", type = "upper", tl.col = "black")
This helps identify clusters of highly correlated features, which might be biologically or clinically significant.
This shows AURKA and ESR1 has a negative corralation while AURKA and BRCA1 has a positive.
Survival Analysis To understand the prognosis of cancer patients, we can plot Kaplan-Meier survival curves to visualize the differences in survival rates based on gene expression levels.
library(survival)
# Binning AURKA expression into high and low groups
aurka_group <- ifelse(aurka > median(aurka), "High", "Low")
# Create a survival object
surv_obj <- Surv(time = metabric_disc_clin_df$T, event = metabric_disc_clin_df$censored)
# Kaplan-Meier survival curves
km_fit <- survfit(surv_obj ~ aurka_group, data = metabric_disc_clin_df)
plot(km_fit, col = c("red", "green"), xlab = "Time (days)", ylab = "Survival Probability", main = "Kaplan-Meier Curves by AURKA Expression")
Exploring Additional Clinical Variables We can extend our analysis by exploring additional clinical variables, such as the number of positive lymph nodes, which is a known prognostic indicator in breast cancer.
# Scatter plot of lymph nodes vs. time to death
plot(metabric_disc_clin_df$T ~ metabric_disc_clin_df$lymph_nodes_positive, xlab = "Positive Lymph Nodes", ylab = "Time to Death")
abline(lm(metabric_disc_clin_df$T ~ metabric_disc_clin_df$lymph_nodes_positive), col = "purple")
Multivariate Analysis To understand the combined effects of multiple variables, we can build a multivariate model. This model includes gene expression and clinical data to predict survival outcomes.
# Multivariate linear model including AURKA and lymph nodes
multivar_model <- lm(metabric_disc_clin_df$T ~ aurka + metabric_disc_clin_df$lymph_nodes_positive)
summary(multivar_model)
##
## Call:
## lm(formula = metabric_disc_clin_df$T ~ aurka + metabric_disc_clin_df$lymph_nodes_positive)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3256.3 -1358.2 -123.3 1367.5 5353.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4126.65 489.51 8.430 < 2e-16
## aurka -134.78 63.51 -2.122 0.0341
## metabric_disc_clin_df$lymph_nodes_positive -120.12 14.79 -8.122 1.36e-15
##
## (Intercept) ***
## aurka *
## metabric_disc_clin_df$lymph_nodes_positive ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1678 on 992 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.07059, Adjusted R-squared: 0.06872
## F-statistic: 37.67 on 2 and 992 DF, p-value: < 2.2e-16
The regression analysis shows that both AURKA expression and the number of positive lymph nodes are significant predictors of survival time. Specifically, an increase in the number of positive lymph nodes is strongly associated with a decrease in survival time, indicating worse prognosis.