About this activity

In this activity, we will put our new skills in R to use with a large real-life dataset!

You will load and examine an R data frame that contains clinical information from over 1,000 breast cancer patients from The Cancer Genome Atlas (TCGA).

The Cancer Genome Atlas or TCGA characterized over 20,000 cancer samples spanning 33 cancer types with genomics.

In 2012 and 2018, the TCGA Network reported findings from the analysis of tumor samples from hundreds of breast cancer patients. Throughout this course, we will examine some of the different data types that were used and the computational analyses that were performed.


Preliminaries

The knitr R package

knitr() is the R package that generates the report from R Markdown. We can create reports as Word doc, PDF, and HTML files.

An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.

The data directory

We will create an object that holds the name of the directory where the TCGA data resides. This is good R coding practice because we can apply all we do below to a data set in a different directory by changing a single variable (data_dir).


Loading the data

Throughout the course, we will be loading data from R files with the extension .RData. An example is the file brca_clin.RData which we examine below.

.RData files are binary files (i.e., not human readable) that hold R objects, such as vectors, lists, matrices, and data frames.

The R dataframe object in the brca_clin.RData file is the result of previous R computations that put raw data downloaded from the TCGA project into tidy form. Tidy data sets have consistent structure and are easy to manipulate, visualize, and model; the are “clean” and ready for us to use!

We could spend the entire course learning how to put raw data in tidy form, but our goal is to understand and analyze the data, which mostly occurs after tidying the data.

# We `load()` the file "brca_clin.RData".
# The file.path() function tells `load()` where our data resides.
# The argument "verbose = TRUE" makes `load()` tell us what R objects are being loaded.
# The objects will also appear in our "Environment" tab.

load(file.path(data_dir, "brca_clin.RData"),verbose=TRUE)   
## Loading objects:
##   brca_clin_df

The object was named to be be descriptive.

  • “brca” stands for “TCGA breast cancer”
  • “clin” is short for “clinical data”
  • “df” stands for “data frame”

Viewing the data

The Environment tab gives us basic information on the objects that were loaded.

We can use some of the R functions we learned previously to examine the data in more detail.

  • dim() tells us the dimensions (# row, # columns) of the objects.
  • head() returns the first several rows of the objects.
  • indexing allows us to look at rows and columns we choose.
dim(brca_clin_df)
## [1] 1082   27
head(brca_clin_df)
##   bcr_patient_barcode gender                      race              ethnicity
## 1        TCGA-3C-AAAU FEMALE                     WHITE NOT HISPANIC OR LATINO
## 2        TCGA-3C-AALI FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 3        TCGA-3C-AALJ FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 4        TCGA-3C-AALK FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 5        TCGA-4H-AAAK FEMALE                     WHITE NOT HISPANIC OR LATINO
## 6        TCGA-5L-AAT0 FEMALE                     WHITE     HISPANIC OR LATINO
##   age_at_diagnosis year_of_initial_pathologic_diagnosis vital_status
## 1               55                                 2004        Alive
## 2               50                                 2003        Alive
## 3               62                                 2011        Alive
## 4               52                                 2011        Alive
## 5               50                                 2013        Alive
## 6               42                                 2010        Alive
##                                                                               menopause_status
## 1 Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)
## 2            Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 3            Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 4                                                                                    [Unknown]
## 5            Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 6            Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
##   tumor_status margin_status days_to_last_followup prior_dx
## 1   WITH TUMOR      Negative                  4047       No
## 2   TUMOR FREE      Negative                  4005       No
## 3   TUMOR FREE      Negative                  1474       No
## 4   TUMOR FREE         Close                  1448       No
## 5   TUMOR FREE      Negative                   348       No
## 6   TUMOR FREE      Positive                  1477      Yes
##   new_tumor_event_after_initial_treatment radiation_therapy
## 1                                      NO                NO
## 2                                      NO               YES
## 3                                      NO                NO
## 4                                      NO                NO
## 5                                      NO                NO
## 6                                      NO               YES
##                histological_type pathologic_T pathologic_M pathologic_N
## 1 Infiltrating Lobular Carcinoma           TX           MX           NX
## 2  Infiltrating Ductal Carcinoma           T2           M0          N1a
## 3  Infiltrating Ductal Carcinoma           T2           M0          N1a
## 4  Infiltrating Ductal Carcinoma          T1c           M0      N0 (i+)
## 5 Infiltrating Lobular Carcinoma           T2           M0          N2a
## 6 Infiltrating Lobular Carcinoma           T2           M0           N0
##   pathologic_stage_sub pathologic_stage lymph_node_examined_count
## 1              Stage X                X                        13
## 2            Stage IIB               II                        15
## 3            Stage IIB               II                        23
## 4             Stage IA                I                         2
## 5           Stage IIIA              III                        14
## 6            Stage IIA               II                         8
##   number_of_lymphnodes_positive initial_diagnosis_method
## 1                             4          [Not Available]
## 2                             1       Core needle biopsy
## 3                             1       Core needle biopsy
## 4                             0       Core needle biopsy
## 5                             4       Core needle biopsy
## 6                             0        Incisional Biopsy
##            surgical_procedure estrogen_receptor_status
## 1 Modified Radical Mastectomy                 Positive
## 2                  Lumpectomy                 Positive
## 3 Modified Radical Mastectomy                 Positive
## 4           Simple Mastectomy                 Positive
## 5 Modified Radical Mastectomy                 Positive
## 6 Modified Radical Mastectomy                 Positive
##   progesterone_receptor_status her2_receptor_status
## 1                     Positive             Negative
## 2                     Positive             Positive
## 3                     Positive        Indeterminate
## 4                     Positive             Positive
## 5                     Positive            Equivocal
## 6                     Positive             Negative

We can combine functions with text to get more informative output:

dim_df <- dim(brca_clin_df)  # We should get a result consistent
                             # with what is in the Environment tab

print(paste("The clinical data frame has",dim_df[1],
              "rows and",dim_df[2],"columns."))
## [1] "The clinical data frame has 1082 rows and 27 columns."
# The function paste() allows us to print text and numbers together.

Since this is a relatively small data frame, we can use the View() function to examine it in detail. View() will open the entire object in a new tab.

What are the features in the data frame? We can collect the column names.

colnames(brca_clin_df)
##  [1] "bcr_patient_barcode"                    
##  [2] "gender"                                 
##  [3] "race"                                   
##  [4] "ethnicity"                              
##  [5] "age_at_diagnosis"                       
##  [6] "year_of_initial_pathologic_diagnosis"   
##  [7] "vital_status"                           
##  [8] "menopause_status"                       
##  [9] "tumor_status"                           
## [10] "margin_status"                          
## [11] "days_to_last_followup"                  
## [12] "prior_dx"                               
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"                      
## [15] "histological_type"                      
## [16] "pathologic_T"                           
## [17] "pathologic_M"                           
## [18] "pathologic_N"                           
## [19] "pathologic_stage_sub"                   
## [20] "pathologic_stage"                       
## [21] "lymph_node_examined_count"              
## [22] "number_of_lymphnodes_positive"          
## [23] "initial_diagnosis_method"               
## [24] "surgical_procedure"                     
## [25] "estrogen_receptor_status"               
## [26] "progesterone_receptor_status"           
## [27] "her2_receptor_status"
write.csv(colnames(brca_clin_df),"features.csv")

Wow! That’s a lot of information. But it’s only a fraction of the information available from TCGA.


Examining features

Examining the data frame in tabular form with View() gives us an idea of the types of information included, but it’s hard to take in, by eye, the values for all 1,082 patients for any given feature, such as “gender”, “vital_status”, or “breast_carcinoma_estrogen_receptor_status”.

Gender

For example, it looks like all of the patients are “FEMALE”, but is this true?

# the `table()` function is VERY useful
# we learned how to pull a column by calling its name,
# for example, the column corresponding to "gender" with `$gender`

#table(brca_clin_df[,2])
table(brca_clin_df$gender)
## 
## FEMALE   MALE 
##   1070     12

There are a dozen male patients in this cohort. A Patient Advocate who works with the Cancer System Biology Consortium (CSBC) and the Physical Sciences in Oncology (PS-ON) Program talks about his experience with breast cancer. You can read more about Bob’s story here Bob Riter: The Male Breast Cancer Coalition.

Age at Diagnosis

#hist(brca_clin_df$age_at_diagnosis)
hist(brca_clin_df$age_at_diagnosis,
     xlab="Age at Diagnosis",
     main="Distribution of Ages")

As many measurements in nature, the variable age_at_diagnosis follows a normal distribution (i.e. it is bell curve).

Estrogen Receptor status

We learned that many breast cancer tumors are estrogen receptor (ER) positive. How many ER+ and ER- tumors are in our dataset?
There’s a data frame column called “estrogen_receptor_status”.

table(brca_clin_df$estrogen_receptor_status)
## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              48               2             236             796

Some samples do not have information on “Negative” (ER-) and “Positive” (ER+) and are either “[Not Evaluated]” or “Indeterminate”. It is usually the case with real-life data that not all records (here, patients) have values for all features or columns in the data frame.

So when working with complex data sets like this, we have to be aware that there may be missing values. We will continually be on the lookout for missing values in our activities!

Progesterone Receptor status

Let’s look at the status of the progesterone receptor negative (PR). There’s a data frame column called “progesterone_receptor_status”.

table(brca_clin_df$progesterone_receptor_status)
## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              49               4             340             689

According to these summaries, there are 796 ER+ samples and 689 PR+ samples.
How many samples have positive status for both estrogen and progesterone receptors?
In other words, how many samples have both ER+ and PR+?

We want to find the patients (rows in brca_clin_df) with

brca_clin_df\(estrogen_receptor_status == "Positive" AND brca_clin_df\)progesterone_receptor_status == “Positive”.

We use == to find the cases that have the value “Positive”.

The function nrow() will give us the number of rows.

Let’s put the pieces together:

# We apply the conditions to the rows.

ERpos_PRpos <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive" &
                   brca_clin_df$progesterone_receptor_status == "Positive"),  ]

nrow(ERpos_PRpos)
## [1] 672

We could also do this in one step:

nrow(brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive" &
                   brca_clin_df$progesterone_receptor_status == "Positive"),  ])
## [1] 672

Overall, the overlap between the 796 ER+ tumors and 689 PR+ tumors is 672.

HER2 status

HER2 is stands for the Receptor tyrosine-protein kinase erbB-2.

There’s a data frame column called “her2_receptor_status”.

table(brca_clin_df$her2_receptor_status)
## 
## [Not Available] [Not Evaluated]       Equivocal   Indeterminate        Negative 
##               8             170             177              12             554 
##        Positive 
##             161

In addition to the values we found for ER and PR, HER2 status can also be “Equivocal” and “Indeterminate.” But still, we’re interested mainly in “Positive” and “Negative.”


Triple Negative Breast Cancer

We also learned that triple negative breast cancer samples are

  • estrogen receptor negative (ER-)
  • progesterone receptor negative (PR-)
  • HER2 negative (HER2-).

How many samples have “Negative” status in all three cases?

As we see from the data frame itself and the three code chunks above, the pieces of information we need are in the columns called

  • estrogen_receptor_status
  • progesterone_receptor_status
  • her2_receptor_status

Let’s make a dataframe that contains the sample names (“bcr_patient_barcode”) and the three receptor status features.

sample_id <- brca_clin_df$bcr_patient_barcode
er_status <- brca_clin_df$estrogen_receptor_status
pr_status <- brca_clin_df$progesterone_receptor_status
her2_status <- brca_clin_df$her2_receptor_status

receptor_status<- cbind(sample_id,er_status,pr_status,her2_status)

dim(receptor_status)
## [1] 1082    4
head(receptor_status)
##      sample_id      er_status  pr_status  her2_status    
## [1,] "TCGA-3C-AAAU" "Positive" "Positive" "Negative"     
## [2,] "TCGA-3C-AALI" "Positive" "Positive" "Positive"     
## [3,] "TCGA-3C-AALJ" "Positive" "Positive" "Indeterminate"
## [4,] "TCGA-3C-AALK" "Positive" "Positive" "Positive"     
## [5,] "TCGA-4H-AAAK" "Positive" "Positive" "Equivocal"    
## [6,] "TCGA-5L-AAT0" "Positive" "Positive" "Negative"

The dataframe receptor_status contains a subset of the full dataframe brca_clin_df.

From receptor_status, we’ll create another dataframe tnbc that contains only those samples that have “Negative” status for all three receptors.

tnbc <- receptor_status[(receptor_status[,2]=="Negative"&
                        receptor_status[,3]=="Negative"&
                        receptor_status[,4]=="Negative"),]

head(tnbc)
##      sample_id      er_status  pr_status  her2_status
## [1,] "TCGA-A1-A0SK" "Negative" "Negative" "Negative" 
## [2,] "TCGA-A1-A0SP" "Negative" "Negative" "Negative" 
## [3,] "TCGA-A2-A04U" "Negative" "Negative" "Negative" 
## [4,] "TCGA-A2-A0CM" "Negative" "Negative" "Negative" 
## [5,] "TCGA-A2-A0D0" "Negative" "Negative" "Negative" 
## [6,] "TCGA-A2-A0D2" "Negative" "Negative" "Negative"
nrow(tnbc)
## [1] 114

So overall, 114 samples out of 1082 total samples correspond to Triple Negative Breast Cancer, the most aggressive type of primary breast cancer. And we know which samples those are because we recorded this information in the object tnbc.

Since we know “sample_id” for the samples with Triple Negative Breast Cancer, we can add additional information about them to our object tnbc.

First, we’ll match the “sample_id” column in tnbc with the “bcr_patient_barcode” column in brca_clin_df.

a <- match(tnbc[,1],brca_clin_df[,1])
a
##   [1]   16   20   27   34   47   49   69   71   73   81   98  109  113  114  130
##  [16]  139  155  161  162  164  179  186  208  234  249  258  260  282  286  298
##  [31]  301  303  305  319  327  337  341  343  356  358  361  362  368  374  378
##  [46]  389  391  400  411  418  440  452  503  504  506  518  523  528  530  560
##  [61]  592  597  599  613  620  625  628  649  668  674  685  693  694  701  702
##  [76]  708  712  714  722  728  747  751  765  767  771  794  798  804  807  813
##  [91]  852  859  862  863  932  945  946  953  957  959  965  971  976  980  983
## [106]  986  987 1002 1012 1018 1023 1052 1067 1071

These numbers are the rows that correspond to the samples in tnbc.

Let’s add the “gender” and “vital_status” columns (columns 2 and 7) from brca_clin_df for these samples to tnbc.

tnbc <- cbind(tnbc, brca_clin_df[a,c(2,7)])   # `a` is a vector of the row numbers we want.
                                              # c(2,7) collects the information from columns 2 and 7
head(tnbc)
##       sample_id er_status pr_status her2_status gender vital_status
## 16 TCGA-A1-A0SK  Negative  Negative    Negative FEMALE         Dead
## 20 TCGA-A1-A0SP  Negative  Negative    Negative FEMALE        Alive
## 27 TCGA-A2-A04U  Negative  Negative    Negative FEMALE        Alive
## 34 TCGA-A2-A0CM  Negative  Negative    Negative FEMALE         Dead
## 47 TCGA-A2-A0D0  Negative  Negative    Negative FEMALE        Alive
## 49 TCGA-A2-A0D2  Negative  Negative    Negative FEMALE        Alive

You can use View() to see the information for all 114 samples in tnbc.


Practice

Feel free to modify the code chunks above or to create new code chunks!

Practice 1

Examine other features in brca_clin_df using the table() function and either indexing or specifying the column name.

For example, both commands provide tables of the information in column 16 which contains information on “histological_type”.

table(brca_clin_df[,15])
## 
##                  [Not Available]       Infiltrating Carcinoma NOS 
##                                1                                1 
##    Infiltrating Ductal Carcinoma   Infiltrating Lobular Carcinoma 
##                              774                              201 
##              Medullary Carcinoma            Metaplastic Carcinoma 
##                                6                                8 
## Mixed Histology (please specify)               Mucinous Carcinoma 
##                               29                               17 
##                   Other, specify 
##                               45
table(brca_clin_df$histological_type)
## 
##                  [Not Available]       Infiltrating Carcinoma NOS 
##                                1                                1 
##    Infiltrating Ductal Carcinoma   Infiltrating Lobular Carcinoma 
##                              774                              201 
##              Medullary Carcinoma            Metaplastic Carcinoma 
##                                6                                8 
## Mixed Histology (please specify)               Mucinous Carcinoma 
##                               29                               17 
##                   Other, specify 
##                               45

We see that most of the breast cancer samples are either “Infiltrating Ductal Carcinoma” or “Infiltrating Lobular Carcinoma”, the two most common types of breast cancer.

# Replace NN with the column number you're interested in.
table(brca_clin_df[,NN])

# Replace column_name with the name of the column in `brca_clin_df`
table(brca_clin_df$column_name)

Practice 2

Create a data frame for the ER+ and PR+ patient samples. In the code chucks for triple negative breast cancer, we created the variables sample_id, er_status and pr_status, and used the function cbind() to create a data frame object called receptor_status. Take a look at that object again:

dim(receptor_status)

head(receptor_status)

Create a new object hormone_receptor_positive that contains only the samples where er_status and pr_status are “Positive”.

# Replace "Feautre_value" with the value you want.

hormone_receptor_positive<- receptor_status[(receptor_status[,2]=="Feature_value"&
                                             receptor_status[,3]=="Feature_value"),]
                        
head(hormone_receptor_positive)

# Uncomment the `View()` command to see the entire object
# View(hormone_receptor_positive)

Notice the values for her2_status in hormone_receptor_positive.

Finally, try adding additional features to your object hormone_receptor_positive following what we did for the object tnbc we created.

Great work! We have learned a lot about breast cancer by examining data systematically collected from patients. It is great that these patients gave their consent to be part of TCGA. The data is anonymized so that it cannot be traced back to any particular person. TCGA data is publicly available and used by researchers worldwide to improve our ability to diagnose, treat, and prevent cancer.

In the next module, we will be introduced to the gene expression data for these patients.

Bonus examinations

There are hundreds of lymph nodes throughout the body. Lymph nodes are part of the lymphatic system, which, in turn, is part of the immune system. Lymph nodes are little bean-shaped structures that contain immune cells that help fight infection. They’re connected by a system of lymph vessels that carry fluid throughout your body.

Cancer spreads when cancer cells break away from the primary tumor. In breast cancer, these cells are most likely to reach the lymph nodes closest to the affected breast. Breast cancer has four stages. When lymph nodes are involved, it’s at least stage 2. Metastatic breast cancer is stage 4.

There are some samples where lymph_node_examined_count is “[Not Available]”. Let’s remove those.

LN_counted<-brca_clin_df[((brca_clin_df$lymph_node_examined_count)!="[Not Available]"),]

range(as.numeric(LN_counted$lymph_node_examined_count))
## [1]  0 44

Let’s also remove the samples where 0 nodes where counted.

LN_n<-LN_counted[as.numeric(LN_counted$lymph_node_examined_count)>0,]
range(as.numeric(LN_n$lymph_node_examined_count))
## [1]  1 44

How many samples have information on lymph node count?

nrow(LN_n)/nrow(brca_clin_df) * 100
## [1] 88.44732

88% of the samples have associated lymph node data.

Let’s get a sense of how many lymph nodes were examined in each sample:

hist(as.numeric(LN_n$lymph_node_examined_count),
     xlab="Number of lymph nodes examined",
     main="Lymph nodes in 88% of patient samples were tested",
     breaks=20)

Now, let’s look at the cases where there is is data in LN_n for number_of_lymphnodes_positive.

LN_positive<-LN_n[((LN_n$number_of_lymphnodes_positive)!="[Not Available]"),]

We can examine how many lymph nodes were positive in each case with a scatterplot.

plot(x=as.numeric(LN_positive$lymph_node_examined_count),
     y=as.numeric(LN_positive$number_of_lymphnodes_positive),
     xlab="Number of lymph nodes counted",
     ylab="Number of lymph nodes positive"
)

The more lymph nodes positive, the more serious the cancer.

The data frame LN_n is a subset of brca_clin_df and can be used to look at the stage of disease as a function of the number of positive lymph nodes. The column pathologic_stage provides information on the aggressivenes of the cancer.