We will load and examine R dataframe objects that contain data from over 1,000 breast cancer (BRCA) patients from The Cancer Genome Atlas (TCGA).
The objects include:
clinical measurements on the patients and the patients’ tumors, such as gender, age, estrogen, progesterone, and her2 receptor status. We examined this data in detail in our previous activity, Breast_Cancer_Patient_Data.Rmd.
gene expression data which tells us how many messenger RNAs (mRNAs) per gene are present in a patient sample. The amount of a gene’s mRNA corresponds (roughly) to the amount of protein in the sample.
We create an object that holds the name of the directory where the TCGA data resides (data_dir).
Like last time, we will load data from R files with the extension .RData. .RData files are binary files (i.e., not human readable) that can store multiple R objects, such as vectors, lists, matrices, and data frames.
# We `load()` TCGA_brca.RData
# The file.path() function tells `load()` where our data resides
# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "TCGA_brca.RData"),verbose=TRUE)
## Loading objects:
## brca_clin_df
## brca_expr_df
During our last meeting, we spent some time exploring the clinical dataframe, brca_clin_df. Let’s review its properties.
# View(brca_clin_df)
Another way to look at the values of a feature is a histogram.
hist(brca_clin_df$age_at_diagnosis,
xlab="Age at Diagnosis",
main="Distribution of Ages")
Another important source of information in The Cancer Genome Atlas (TCGA) breast cancer set is the gene expression data.
dim_df <- dim(brca_expr_df) # Check the Environment tab, too!
print(paste("The gene expression data frame has",dim_df[1],
"rows and",dim_df[2],"columns."))
## [1] "The gene expression data frame has 18351 rows and 1083 columns."
# The function `paste()` allows us to `print()` text and numbers together.
Let’s see what’s in the rows and columns using indexing to get
brca_expr_df[1:10,1:5]
## symbol TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK
## 1 TSPAN6 188.18 207.7 1005 1104.7
## 2 TNMD 0.34 1.1 40 1.2
## 3 DPM1 524.23 809.1 967 463.4
## 4 SCYL3 325.14 1558.9 336 549.3
## 5 C1orf112 124.29 274.7 242 218.2
## 6 FGR 75.14 182.7 284 144.0
## 7 CFH 348.11 1119.6 1031 1262.3
## 8 FUCA2 563.17 1099.5 872 862.6
## 9 GCLC 578.68 903.2 633 602.0
## 10 NFYA 1408.62 756.9 730 1150.6
The first column, “symbol”, provides the names of genes whose mRNA expression levels were measured in the experiments.
The remaining columns are the expression levels of the messenger RNAs (mRNAs) from different patient samples. The names of the columns are anonymized indentifiers for patient samples.
Remember that the clinical data is organized so that patient samples are the rows of the data frame:
brca_clin_df[1:5,1:10]
## bcr_patient_barcode gender race ethnicity
## 1 TCGA-3C-AAAU FEMALE WHITE NOT HISPANIC OR LATINO
## 2 TCGA-3C-AALI FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 3 TCGA-3C-AALJ FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 4 TCGA-3C-AALK FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 5 TCGA-4H-AAAK FEMALE WHITE NOT HISPANIC OR LATINO
## age_at_diagnosis year_of_initial_pathologic_diagnosis vital_status
## 1 55 2004 Alive
## 2 50 2003 Alive
## 3 62 2011 Alive
## 4 52 2011 Alive
## 5 50 2013 Alive
## menopause_status
## 1 Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)
## 2 Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 3 Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 4 [Unknown]
## 5 Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## tumor_status margin_status
## 1 WITH TUMOR Negative
## 2 TUMOR FREE Negative
## 3 TUMOR FREE Negative
## 4 TUMOR FREE Close
## 5 TUMOR FREE Negative
Notice that both the expression and clinical dataframes have information for the same patient samples: TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK
In the next Module, we will combine clinical and expression data for each patient.
identical(colnames(brca_expr_df)[-1],brca_clin_df[,1])
## [1] TRUE
To do analysis with the expression data frame, we have to create a matrix that contains only numerical data.
The first column of brca_expr_df contains gene names, which are text characters.
typeof(brca_expr_df[,1])
## [1] "character"
The remaining columns are numeric:
typeof(brca_expr_df[,2])
## [1] "double"
“double” stands for “double-precision floating-point format.” In other words, a decimal number.
We want to create a matrix that contains only the numerical gene expression data. But we want to keep the information from the “symbol” column so we know what genes (mRNAs) the expression levels refer to.
# We remove the first column ("symbol"),
# use the function as.matrix().
# and assign the result to a new object ("brca_expr_mat").
# Indexing with -1 removes the first column
brca_expr_mat <- as.matrix(brca_expr_df[,-1])
# We want the row names to come from the symbol column,
# so we'll use the function row.names().
row.names(brca_expr_mat) <- brca_expr_df$symbol
# check the result
brca_expr_mat[1:5, 1:5]
## TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6 188.18 207.7 1005 1104.7 942.6
## TNMD 0.34 1.1 40 1.2 4.7
## DPM1 524.23 809.1 967 463.4 486.4
## SCYL3 325.14 1558.9 336 549.3 416.3
## C1orf112 124.29 274.7 242 218.2 179.5
# View(as.numeric(brca_expr_mat[1:100, 1:10]))
Let’s look at the average value of the expression for the first gene (TSPAN6) across the patient samples.
# The function `mean()` takes the average of a list of numbers
# We are looking at the first row of values
mean(brca_expr_mat[1,])
## [1] 996
mean(brca_expr_mat["TSPAN6",])
## [1] 996
What is the average expression of the second gene (TNMD)?
# The function `mean()` takes the average of a list of numbers
mean(brca_expr_mat[2,])
## [1] 24
mean(brca_expr_mat["TNMD",])
## [1] 24
The average of the expression is much smaller for the second gene (23.7) than the first gene (996.2). Does this mean anything??
The gene TSPAN6 codes for the protein Tetraspanin-6. This protein helps signal events that play a role in cell growth and motility, and its higher level suggests motility of breast cancer cells may be important.
The gene TNMD codes for the protein Tenomodulin, which is important for the formation of tendons. TNMD is highly expressed in tendons, but lowly expressed in other parts of the body. Its lower level suggests it does not play a role in breast cancer.
It would be tedious to examine each of the 18,351 genes in the expression matrix!!
We can use the function apply() to apply a function like mean() to all rows or columns of our matrix at once.
# We "apply" the function mean() to all rows.
# The first argument is the matrix.
# The second argument is 1 for "rows".
# The third argument is the function name, `mean()`
mean.expr <- apply(brca_expr_mat, 1, mean)
mean.expr[1:25]
## TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR CFH FUCA2
## 996 24 699 624 320 187 1246 1432
## GCLC NFYA STPG1 NIPAL3 LAS1L ENPP4 SEMA3F CFTR
## 869 1326 215 985 1143 642 2180 19
## ANKIB1 CYP51A1 KRIT1 RAD52 BAD LAP3 CD99 HS3ST1
## 1614 1945 738 130 853 2646 2751 91
## AOC1
## 84
Now that we have the mean expression for all 18K genes, how else may we use this information?
Let’s try plotting the mean expression values (we saved these values in the object mean.expr) as a histogram.
hist(mean.expr,
main="Distribution of Gene Expression Values",
xlab="Mean Expression")
Wow! Most of the average expression values are relatively small but it’s hard to see that range because of a few VERY large values.
How many values in mean.expr are greater than 10000? We can find this out by choosing only those values in mean.expr that are greater than 10000.
# This is a logical expression
greater_than_10000 <- (mean.expr > 10000)
greater_than_10000[1:10]
## TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR CFH FUCA2
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## GCLC NFYA
## FALSE FALSE
When we use an expression like “mean.expr > 10000” we get a vector that tells us if the condition is TRUE or FALSE for that gene. Note that most genes do NOT have expression level greater than 10000.
We can use the logical condition to create an objective that has only genes with expression level > 10000.
# Create an object for expression values > 10,000
mean.expr.high <- mean.expr[mean.expr > 10000]
length(mean.expr.high)
## [1] 201
Out of 18,351 mean values, only 201 are greater than 10000!
Let’s zoom in on the average expression values less than 10000 with the logical condition “mean.expr < 10000”.
# Let's look at the "distribution" of mean expression values
# that are less than 10000.
mean.expr.low <- mean.expr[mean.expr < 10000]
hist(mean.expr.low,
breaks=50,
main="Distribution of Gene Expression Values",
xlab="Mean Expression")
By “zooming” in, we see that most genes actually have average expression < 2000.
There are only a few genes that have really large average expression (201 out of 18351 genes). The curve in the histogram has a long right tail.
Nearly all genes have relatively low expression, and very few genes have high expression.
In other words, our data is highly skewed.
Many functions in R assume the data are normally distributed, which means their histogram is expected to be bell-shaped.
For example, in the clinical data frame we saw that the ages of the patients follows a normal distribution.
hist(brca_clin_df$age_at_diagnosis,
xlab="Age at Diagnosis",
main="Distribution of Ages")
But this is not the situation with our expression data!
The log transformation can be used to make highly skewed distributions less skewed and sometimes normally distributed. And this can be valuable both for
Expression data analysis is usually done with log base 2.
Log base 2 of 8 is 3 because 2 * 2 * 2 = 8.
Log base 2 of 4 is 2 because 2 * 2 = 4.
Log base 2 of 2 is 1 because 2 = 2.
So the values (8, 4, 2) are log transformed to (3, 2, 1).
Let’s consider some larger numbers:
a <- c(4096, 1024, 512, 64)
b <- log(a,2)
plot(a,b,pch=19,col="red")
Notice how the scale for b (vertical axis) is much more compressed than for a (horizontal axis).
There is one caveat. The log2 of 0 is infinity!
log(0,2)
## [1] -Inf
But the log2 of 1 is “well-behaved”.
log(1,2)
## [1] 0
So before taking the log2 of our expression data, we will add 1 to each value to avoid the occurence of “-Inf” values. This is standard practice and, as evidenced from the histogram of expression values, it is a small change.
# Add 1 to each element, then take the log base 2.
brca_expr_mat.log <- log(brca_expr_mat+1, 2)
# Look at some values:
brca_expr_mat.log[1:5,1:5]
## TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6 7.56 7.7 10.0 10.1 9.9
## TNMD 0.43 1.1 5.4 1.2 2.5
## DPM1 9.04 9.7 9.9 8.9 8.9
## SCYL3 8.35 10.6 8.4 9.1 8.7
## C1orf112 6.97 8.1 7.9 7.8 7.5
Let’s check out what the curve is for the log-transformed expression values:
mean.expr <- apply(brca_expr_mat.log, 1, mean)
hist(mean.expr, breaks=100, main="Distribution of Log_2 Gene Expression Values",
xlab="Mean Log_2 Expression")
The log-transformed data is less scewed and even has some bell-shaped character.
While this isn’t a true normal distribution of data (especially at the lower range), it is much closer, and importantly, close enough for some exploratory analyses.
In our future analyses, we will be working with the log-transformed data matrix brca_expr_mat.log.
Learn about genes related to breast cancer at this site: https://www.nationalbreastcancer.org/other-breast-cancer-genes and look at their average values across patients.
mean(brca_expr_mat["FGFR2",])
## [1] 1363
mean(brca_expr_mat["MAP3K1",])
## [1] 2014
FGFR2 and MAP3K1 increase the risk of breast cancer and often carry mutations. You can learn more about their functions at the UniProt Knowledgebase https://www.uniprot.org/uniprotkb.
We learned we can get the average expression using the names of the genes themselves.
BRCA1 encodes for the breast cancer susceptibility gene which is often mutated in cancer.
# use indexing to get mean expression for BRCA1 and CDH1
brca1_expr<-mean(brca_expr_mat[ , ])
cdh1_expr<-mean(brca_expr_mat[ , ])
Provide a hypothesis to explain the difference in expression of BRCA1 versus CDH1.
brca1_vs_cdh1<-" place text between quotes "
Let’s save our work as an html file!
knitr() is the R package that generates reports from R Markdown. R Markdown is the syntax we are using in this document, denoted by the .Rmd file type. With knitr() and R Markdown, we can create reports of our work in the form of Word doc, PDF, and HTML files.
An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.
Click the “Knit” button at the top of this window and select “Knit to html.”
Congratulations on completing another activity!
Now that we’ve gained some experience working with expression data, we can move onto the next activity and look for interesting patterns.
Practice 1 What features does the data frame contain?
# replace `function` with the function
# that will return the column names of the data frame
colnames(brca_clin_df)
## [1] "bcr_patient_barcode"
## [2] "gender"
## [3] "race"
## [4] "ethnicity"
## [5] "age_at_diagnosis"
## [6] "year_of_initial_pathologic_diagnosis"
## [7] "vital_status"
## [8] "menopause_status"
## [9] "tumor_status"
## [10] "margin_status"
## [11] "days_to_last_followup"
## [12] "prior_dx"
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"
## [15] "histological_type"
## [16] "pathologic_T"
## [17] "pathologic_M"
## [18] "pathologic_N"
## [19] "pathologic_stage_sub"
## [20] "pathologic_stage"
## [21] "lymph_node_examined_count"
## [22] "number_of_lymphnodes_positive"
## [23] "initial_diagnosis_method"
## [24] "surgical_procedure"
## [25] "estrogen_receptor_status"
## [26] "progesterone_receptor_status"
## [27] "her2_receptor_status"
Practice 2
The function unique() will provide the unique values in a list.
First, how many total values are in the column for estrogen receptor (ER) status?
length(brca_clin_df$estrogen_receptor_status)
## [1] 1082
Then, what are the unique values in the estrogen receptor (ER) status column?
unique(brca_clin_df$estrogen_receptor_status)
## [1] "Positive" "Negative" "[Not Evaluated]" "Indeterminate"
Finally, note that we get both the unique values and the number of times they occur with the table() function we used last time.
table(brca_clin_df$estrogen_receptor_status)
##
## [Not Evaluated] Indeterminate Negative Positive
## 48 2 236 796
Practice 3: Find the unique values of other features.
# type the $ symbol and select a feature (column) to find its unique values
unique(brca_clin_df$prior_dx )
Practice 5
Combine code chunks with text to provide more comprehensible results:
# mean() takes the average of a set of values
avg_TSPAN6_expr <- mean(brca_expr_mat[1,]) # We are taking the average of expression values
# for the first row of the matrix
avg_TNMD_expr <- mean(brca_expr_mat[2,]) # We are taking the average of expression values
# for the second row of the matrix
print(paste("Average expression of TSPAN6: ", round(avg_TSPAN6_expr,0)))
print(paste("Average expression of TNMD: ", round(avg_TNMD_expr,0)))
# The function paste() allows us to print text and numbers together.
# The function round() lets us choose how many decimal digits we want to show.