In this activity, we will put our new skills in R to use with a large real-life dataset!
You will load and examine an R data frame that contains clinical information from over 1,000 breast cancer patients from The Cancer Genome Atlas (TCGA).
The Cancer Genome Atlas or TCGA characterized over 20,000 cancer samples spanning 33 cancer types with genomics.
In 2012 and 2018, the TCGA Network reported findings from the analysis of tumor samples from hundreds of breast cancer patients. Throughout this course, we will examine some of the different data types that were used and the computational analyses that were performed.
knitr() is the R package that generates the report from R Markdown. We can create reports as Word doc, PDF, and HTML files.
An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.
We will create an object that holds the name of the directory where the TCGA data resides. This is good R coding practice because we can apply all we do below to a data set in a different directory by changing a single variable (data_dir).
Throughout the course, we will be loading data from R files with the extension .RData. An example is the file brca_clin.RData which we examine below.
.RData files are binary files (i.e., not human readable) that hold R objects, such as vectors, lists, matrices, and data frames.
The R dataframe object in the brca_clin.RData file is the result of previous R computations that put raw data downloaded from the TCGA project into tidy form. Tidy data sets have consistent structure and are easy to manipulate, visualize, and model; the are “clean” and ready for us to use!
We could spend the entire course learning how to put raw data in tidy form, but our goal is to understand and analyze the data, which mostly occurs after tidying the data.
# We `load()` the file "brca_clin.RData".
# The file.path() function tells `load()` where our data resides.
# The argument "verbose = TRUE" makes `load()` tell us what R objects are being loaded.
# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "brca_clin.RData"),verbose=TRUE)
## Loading objects:
## brca_clin_df
The object was named to be be descriptive.
The Environment tab gives us basic information on the objects that were loaded.
We can use some of the R functions we learned previously to examine the data in more detail.
dim() tells us the dimensions (# row, # columns) of the objects.head() returns the first several rows of the objects.dim(brca_clin_df)
## [1] 1082 27
head(brca_clin_df)
## bcr_patient_barcode gender race ethnicity
## 1 TCGA-3C-AAAU FEMALE WHITE NOT HISPANIC OR LATINO
## 2 TCGA-3C-AALI FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 3 TCGA-3C-AALJ FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 4 TCGA-3C-AALK FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 5 TCGA-4H-AAAK FEMALE WHITE NOT HISPANIC OR LATINO
## 6 TCGA-5L-AAT0 FEMALE WHITE HISPANIC OR LATINO
## age_at_diagnosis year_of_initial_pathologic_diagnosis vital_status
## 1 55 2004 Alive
## 2 50 2003 Alive
## 3 62 2011 Alive
## 4 52 2011 Alive
## 5 50 2013 Alive
## 6 42 2010 Alive
## menopause_status
## 1 Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)
## 2 Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 3 Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 4 [Unknown]
## 5 Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 6 Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## tumor_status margin_status days_to_last_followup prior_dx
## 1 WITH TUMOR Negative 4047 No
## 2 TUMOR FREE Negative 4005 No
## 3 TUMOR FREE Negative 1474 No
## 4 TUMOR FREE Close 1448 No
## 5 TUMOR FREE Negative 348 No
## 6 TUMOR FREE Positive 1477 Yes
## new_tumor_event_after_initial_treatment radiation_therapy
## 1 NO NO
## 2 NO YES
## 3 NO NO
## 4 NO NO
## 5 NO NO
## 6 NO YES
## histological_type pathologic_T pathologic_M pathologic_N
## 1 Infiltrating Lobular Carcinoma TX MX NX
## 2 Infiltrating Ductal Carcinoma T2 M0 N1a
## 3 Infiltrating Ductal Carcinoma T2 M0 N1a
## 4 Infiltrating Ductal Carcinoma T1c M0 N0 (i+)
## 5 Infiltrating Lobular Carcinoma T2 M0 N2a
## 6 Infiltrating Lobular Carcinoma T2 M0 N0
## pathologic_stage_sub pathologic_stage lymph_node_examined_count
## 1 Stage X X 13
## 2 Stage IIB II 15
## 3 Stage IIB II 23
## 4 Stage IA I 2
## 5 Stage IIIA III 14
## 6 Stage IIA II 8
## number_of_lymphnodes_positive initial_diagnosis_method
## 1 4 [Not Available]
## 2 1 Core needle biopsy
## 3 1 Core needle biopsy
## 4 0 Core needle biopsy
## 5 4 Core needle biopsy
## 6 0 Incisional Biopsy
## surgical_procedure estrogen_receptor_status
## 1 Modified Radical Mastectomy Positive
## 2 Lumpectomy Positive
## 3 Modified Radical Mastectomy Positive
## 4 Simple Mastectomy Positive
## 5 Modified Radical Mastectomy Positive
## 6 Modified Radical Mastectomy Positive
## progesterone_receptor_status her2_receptor_status
## 1 Positive Negative
## 2 Positive Positive
## 3 Positive Indeterminate
## 4 Positive Positive
## 5 Positive Equivocal
## 6 Positive Negative
We can combine functions with text to get more informative output:
dim_df <- dim(brca_clin_df) # We should get a result consistent
# with what is in the Environment tab
print(paste("The clinical data frame has",dim_df[1],
"rows and",dim_df[2],"columns."))
## [1] "The clinical data frame has 1082 rows and 27 columns."
# The function paste() allows us to print text and numbers together.
Since this is a relatively small data frame, we can use the View() function to examine it in detail. View() will open the entire object in a new tab.
What are the features in the data frame? We can collect the column names.
colnames(brca_clin_df)
## [1] "bcr_patient_barcode"
## [2] "gender"
## [3] "race"
## [4] "ethnicity"
## [5] "age_at_diagnosis"
## [6] "year_of_initial_pathologic_diagnosis"
## [7] "vital_status"
## [8] "menopause_status"
## [9] "tumor_status"
## [10] "margin_status"
## [11] "days_to_last_followup"
## [12] "prior_dx"
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"
## [15] "histological_type"
## [16] "pathologic_T"
## [17] "pathologic_M"
## [18] "pathologic_N"
## [19] "pathologic_stage_sub"
## [20] "pathologic_stage"
## [21] "lymph_node_examined_count"
## [22] "number_of_lymphnodes_positive"
## [23] "initial_diagnosis_method"
## [24] "surgical_procedure"
## [25] "estrogen_receptor_status"
## [26] "progesterone_receptor_status"
## [27] "her2_receptor_status"
write.csv(colnames(brca_clin_df),"features.csv")
Wow! That’s a lot of information. But it’s only a fraction of the information available from TCGA.
Examining the data frame in tabular form with View() gives us an idea of the types of information included, but it’s hard to take in, by eye, the values for all 1,082 patients for any given feature, such as “gender”, “vital_status”, or “breast_carcinoma_estrogen_receptor_status”.
For example, it looks like all of the patients are “FEMALE”, but is this true?
# the `table()` function is VERY useful
# we learned how to pull a column by calling its name,
# for example, the column corresponding to "gender" with `$gender`
#table(brca_clin_df[,2])
table(brca_clin_df$gender)
##
## FEMALE MALE
## 1070 12
There are a dozen male patients in this cohort. A Patient Advocate who works with the Cancer System Biology Consortium (CSBC) and the Physical Sciences in Oncology (PS-ON) Program talks about his experience with breast cancer. You can read more about Bob’s story here Bob Riter: The Male Breast Cancer Coalition.
#hist(brca_clin_df$age_at_diagnosis)
hist(brca_clin_df$age_at_diagnosis,
xlab="Age at Diagnosis",
main="Distribution of Ages")
As many measurements in nature, the variable age_at_diagnosis follows a normal distribution (i.e. it is bell curve).
We learned that many breast cancer tumors are estrogen receptor (ER) positive. How many ER+ and ER- tumors are in our dataset?
There’s a data frame column called “estrogen_receptor_status”.
table(brca_clin_df$estrogen_receptor_status)
##
## [Not Evaluated] Indeterminate Negative Positive
## 48 2 236 796
Some samples do not have information on “Negative” (ER-) and “Positive” (ER+) and are either “[Not Evaluated]” or “Indeterminate”. It is usually the case with real-life data that not all records (here, patients) have values for all features or columns in the data frame.
So when working with complex data sets like this, we have to be aware that there may be missing values. We will continually be on the lookout for missing values in our activities!
Let’s look at the status of the progesterone receptor negative (PR). There’s a data frame column called “progesterone_receptor_status”.
table(brca_clin_df$progesterone_receptor_status)
##
## [Not Evaluated] Indeterminate Negative Positive
## 49 4 340 689
According to these summaries, there are 796 ER+ samples and 689 PR+ samples.
How many samples have positive status for both estrogen and progesterone receptors?
In other words, how many samples have both ER+ and PR+?
We want to find the patients (rows in brca_clin_df) with
brca_clin_df\(estrogen_receptor_status == "Positive" AND brca_clin_df\)progesterone_receptor_status == “Positive”.
We use == to find the cases that have the value “Positive”.
The function nrow() will give us the number of rows.
Let’s put the pieces together:
# We apply the conditions to the rows.
ERpos_PRpos <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive" &
brca_clin_df$progesterone_receptor_status == "Positive"), ]
nrow(ERpos_PRpos)
## [1] 672
We could also do this in one step:
nrow(brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive" &
brca_clin_df$progesterone_receptor_status == "Positive"), ])
## [1] 672
Overall, the overlap between the 796 ER+ tumors and 689 PR+ tumors is 672.
HER2 is stands for the Receptor tyrosine-protein kinase erbB-2.
There’s a data frame column called “her2_receptor_status”.
table(brca_clin_df$her2_receptor_status)
##
## [Not Available] [Not Evaluated] Equivocal Indeterminate Negative
## 8 170 177 12 554
## Positive
## 161
In addition to the values we found for ER and PR, HER2 status can also be “Equivocal” and “Indeterminate.” But still, we’re interested mainly in “Positive” and “Negative.”
We also learned that triple negative breast cancer samples are
How many samples have “Negative” status in all three cases?
As we see from the data frame itself and the three code chunks above, the pieces of information we need are in the columns called
Let’s make a dataframe that contains the sample names (“bcr_patient_barcode”) and the three receptor status features.
sample_id <- brca_clin_df$bcr_patient_barcode
er_status <- brca_clin_df$estrogen_receptor_status
pr_status <- brca_clin_df$progesterone_receptor_status
her2_status <- brca_clin_df$her2_receptor_status
receptor_status<- cbind(sample_id,er_status,pr_status,her2_status)
dim(receptor_status)
## [1] 1082 4
head(receptor_status)
## sample_id er_status pr_status her2_status
## [1,] "TCGA-3C-AAAU" "Positive" "Positive" "Negative"
## [2,] "TCGA-3C-AALI" "Positive" "Positive" "Positive"
## [3,] "TCGA-3C-AALJ" "Positive" "Positive" "Indeterminate"
## [4,] "TCGA-3C-AALK" "Positive" "Positive" "Positive"
## [5,] "TCGA-4H-AAAK" "Positive" "Positive" "Equivocal"
## [6,] "TCGA-5L-AAT0" "Positive" "Positive" "Negative"
The dataframe receptor_status contains a subset of the full dataframe brca_clin_df.
From receptor_status, we’ll create another dataframe tnbc that contains only those samples that have “Negative” status for all three receptors.
tnbc <- receptor_status[(receptor_status[,2]=="Negative"&
receptor_status[,3]=="Negative"&
receptor_status[,4]=="Negative"),]
head(tnbc)
## sample_id er_status pr_status her2_status
## [1,] "TCGA-A1-A0SK" "Negative" "Negative" "Negative"
## [2,] "TCGA-A1-A0SP" "Negative" "Negative" "Negative"
## [3,] "TCGA-A2-A04U" "Negative" "Negative" "Negative"
## [4,] "TCGA-A2-A0CM" "Negative" "Negative" "Negative"
## [5,] "TCGA-A2-A0D0" "Negative" "Negative" "Negative"
## [6,] "TCGA-A2-A0D2" "Negative" "Negative" "Negative"
nrow(tnbc)
## [1] 114
So overall, 114 samples out of 1082 total samples correspond to Triple Negative Breast Cancer, the most aggressive type of primary breast cancer. And we know which samples those are because we recorded this information in the object tnbc.
Since we know “sample_id” for the samples with Triple Negative Breast Cancer, we can add additional information about them to our object tnbc.
First, we’ll match the “sample_id” column in tnbc with the “bcr_patient_barcode” column in brca_clin_df.
a <- match(tnbc[,1],brca_clin_df[,1])
a
## [1] 16 20 27 34 47 49 69 71 73 81 98 109 113 114 130
## [16] 139 155 161 162 164 179 186 208 234 249 258 260 282 286 298
## [31] 301 303 305 319 327 337 341 343 356 358 361 362 368 374 378
## [46] 389 391 400 411 418 440 452 503 504 506 518 523 528 530 560
## [61] 592 597 599 613 620 625 628 649 668 674 685 693 694 701 702
## [76] 708 712 714 722 728 747 751 765 767 771 794 798 804 807 813
## [91] 852 859 862 863 932 945 946 953 957 959 965 971 976 980 983
## [106] 986 987 1002 1012 1018 1023 1052 1067 1071
These numbers are the rows that correspond to the samples in tnbc.
Let’s add the “gender” and “vital_status” columns (columns 2 and 7) from brca_clin_df for these samples to tnbc.
tnbc <- cbind(tnbc, brca_clin_df[a,c(2,7)]) # `a` is a vector of the row numbers we want.
# c(2,7) collects the information from columns 2 and 7
head(tnbc)
## sample_id er_status pr_status her2_status gender vital_status
## 16 TCGA-A1-A0SK Negative Negative Negative FEMALE Dead
## 20 TCGA-A1-A0SP Negative Negative Negative FEMALE Alive
## 27 TCGA-A2-A04U Negative Negative Negative FEMALE Alive
## 34 TCGA-A2-A0CM Negative Negative Negative FEMALE Dead
## 47 TCGA-A2-A0D0 Negative Negative Negative FEMALE Alive
## 49 TCGA-A2-A0D2 Negative Negative Negative FEMALE Alive
You can use View() to see the information for all 114 samples in tnbc.
Feel free to modify the code chunks above or to create new code chunks!
Examine other features in brca_clin_df using the table() function and either indexing or specifying the column name.
For example, both commands provide tables of the information in column 16 which contains information on “histological_type”.
table(brca_clin_df[,15])
##
## [Not Available] Infiltrating Carcinoma NOS
## 1 1
## Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma
## 774 201
## Medullary Carcinoma Metaplastic Carcinoma
## 6 8
## Mixed Histology (please specify) Mucinous Carcinoma
## 29 17
## Other, specify
## 45
table(brca_clin_df$histological_type)
##
## [Not Available] Infiltrating Carcinoma NOS
## 1 1
## Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma
## 774 201
## Medullary Carcinoma Metaplastic Carcinoma
## 6 8
## Mixed Histology (please specify) Mucinous Carcinoma
## 29 17
## Other, specify
## 45
We see that most of the breast cancer samples are either “Infiltrating Ductal Carcinoma” or “Infiltrating Lobular Carcinoma”, the two most common types of breast cancer.
# Replace NN with the column number you're interested in.
table(brca_clin_df[,NN])
# Replace column_name with the name of the column in `brca_clin_df`
table(brca_clin_df$column_name)
Create a data frame for the ER+ and PR+ patient samples. In the code chucks for triple negative breast cancer, we created the variables sample_id, er_status and pr_status, and used the function cbind() to create a data frame object called receptor_status. Take a look at that object again:
dim(receptor_status)
head(receptor_status)
Create a new object hormone_receptor_positive that contains only the samples where er_status and pr_status are “Positive”.
# Replace "Feautre_value" with the value you want.
hormone_receptor_positive<- receptor_status[(receptor_status[,2]=="Feature_value"&
receptor_status[,3]=="Feature_value"),]
head(hormone_receptor_positive)
# Uncomment the `View()` command to see the entire object
# View(hormone_receptor_positive)
Notice the values for her2_status in hormone_receptor_positive.
Finally, try adding additional features to your object hormone_receptor_positive following what we did for the object tnbc we created.
Great work! We have learned a lot about breast cancer by examining data systematically collected from patients. It is great that these patients gave their consent to be part of TCGA. The data is anonymized so that it cannot be traced back to any particular person. TCGA data is publicly available and used by researchers worldwide to improve our ability to diagnose, treat, and prevent cancer.
In the next module, we will be introduced to the gene expression data for these patients.
There are hundreds of lymph nodes throughout the body. Lymph nodes are part of the lymphatic system, which, in turn, is part of the immune system. Lymph nodes are little bean-shaped structures that contain immune cells that help fight infection. They’re connected by a system of lymph vessels that carry fluid throughout your body.
Cancer spreads when cancer cells break away from the primary tumor. In breast cancer, these cells are most likely to reach the lymph nodes closest to the affected breast. Breast cancer has four stages. When lymph nodes are involved, it’s at least stage 2. Metastatic breast cancer is stage 4.
There are some samples where lymph_node_examined_count is “[Not Available]”. Let’s remove those.
LN_counted<-brca_clin_df[((brca_clin_df$lymph_node_examined_count)!="[Not Available]"),]
range(as.numeric(LN_counted$lymph_node_examined_count))
## [1] 0 44
Let’s also remove the samples where 0 nodes where counted.
LN_n<-LN_counted[as.numeric(LN_counted$lymph_node_examined_count)>0,]
range(as.numeric(LN_n$lymph_node_examined_count))
## [1] 1 44
How many samples have information on lymph node count?
nrow(LN_n)/nrow(brca_clin_df) * 100
## [1] 88.44732
88% of the samples have associated lymph node data.
Let’s get a sense of how many lymph nodes were examined in each sample:
hist(as.numeric(LN_n$lymph_node_examined_count),
xlab="Number of lymph nodes examined",
main="Lymph nodes in 88% of patient samples were tested",
breaks=20)
Now, let’s look at the cases where there is is data in LN_n for number_of_lymphnodes_positive.
LN_positive<-LN_n[((LN_n$number_of_lymphnodes_positive)!="[Not Available]"),]
We can examine how many lymph nodes were positive in each case with a scatterplot.
plot(x=as.numeric(LN_positive$lymph_node_examined_count),
y=as.numeric(LN_positive$number_of_lymphnodes_positive),
xlab="Number of lymph nodes counted",
ylab="Number of lymph nodes positive"
)
The more lymph nodes positive, the more serious the cancer.
The data frame LN_n is a subset of brca_clin_df and can be used to look at the stage of disease as a function of the number of positive lymph nodes. The column pathologic_stage provides information on the aggressivenes of the cancer.