In this activity, we will learn new skills in R with a large real-life dataset!

We will load and examine a data frame that contains clinical information from over 1,000 breast cancer patients from The Cancer Genome Atlas (TCGA).

TCGA characterized over 20,000 cancer samples spanning 33 cancer types with genomics.

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. Learn more here!.

Throughout this course, we will examine some of the different data types and the computational analyses that were performed to decipher breast cancer data from TCGA.


Preliminaries

Preliminaries are the tasks we need to do before we begin our analysis.

Create your own version!

Go to the File tab and use Save as... to create your own copy of this activity. Why? Because we inevitably make mistakes, and it’s great to have the original as reference or to start over!

Include yourself as an author

This is your workbook and you will be making changes, answering questions, and doing your own analyses. Add you name after the second hyphen in under author in the header of this file. You can change the date as well!

Load packages

Many users and developers contribute packages to R that contain useful functionality not available with the default R distribution. At the beginning of our activities, we load the special packages we need with the library() function.

Click the green arrow in the top right corner of the code chunk to run the code

The data directory

We will create an object that holds the name of the directory where our TCGA data resides. This is good programming practice: We can examine data in a different location by changing the name of the data directory once at the beginning of our code.

# The data we are using is located in /home/shared/data on our cloud server.
data_dir <- "/home/data"     

# Look in your Environment tab for the `data_dir` object!

Breast Cancer Basics

Breast cancer most often begins in the ducts.

# Note that objects in R can be images.
# Images are matrices of pixel values.

# The `file.path()` function tells `load.image()` where our image resides.

anatomy_img<-load.image(file.path(data_dir,"breast_anatomy.jpg"))

# `par()` defines parameters for plotting
# The parameter `mar` accepts four values as margins for the image
par(mar=c(2,2,2,2))

plot(anatomy_img,axes=FALSE)

In breast cancer, the cells lining the ducts grow and divide abnormally.

duct_cells_img<-load.image(file.path(data_dir,"duct_cells.jpg"))

par(font.sub=4, mar=c(2,2,2,2))

plot(duct_cells_img,axes=FALSE)

We will examine patient data for these breast cancer ductal types.

Metastasis involves the development of secondary malignant growths at a distance from a primary site of cancer.

# The `file.path()` function tells `load.image()` where our image resides.
meta_img<-load.image(file.path(data_dir,"metastasis.jpg"))

par(font.sub=4, mar=c(2,2,2,2))

plot(meta_img,axes=FALSE)


Loading data

Throughout the course, we will be loading data from R files with the extension .RData. An example is the file brca_clin.RData which we will work with in this activity.

.RData files are binary files (i.e., not human readable) that hold R objects, such as data frames.

# The `load()` function reads in the data file "brca_clin.RData".
# The `file.path()` tells `load()` where our data resides.
# The argument "verbose = TRUE" makes `load()` tell us what R objects are being loaded.
# The objects will also appear in our "Environment" tab.

load(file.path(data_dir, "brca_clin.RData"),verbose=TRUE)   
## Loading objects:
##   brca_clin_df

The object brca_clin_df was named to be be descriptive.

  • “brca” stands for “TCGA breast cancer”
  • “clin” is short for “clinical data”
  • “df” stands for “data frame”

Data frames are data displayed as a table.

In data science, columns in a data frame are features and rows are records or observations.


Viewing the data

The Environment tab gives us basic information on the objects that were loaded.

There are 1082 observations (patients) with 27 variables (clinical features).

We can use some simple R functions to examine the data in more detail.

  • dim() tells us the dimensions (# rows, # columns) of the object.
  • head() returns the first several rows of the objects.
  • indexing allows us to look at rows and columns we choose.
# How many rows and columns are in our object?
dim(brca_clin_df)
## [1] 1082   27

Is the output what we expect? Check your Environment tab.

# Let's take a first look at the data in our object
# `head` gives us the first six rows and ALL of the columns
head(brca_clin_df)

Scroll through the columns by clicking on the arrow in the top right corner of the table.

We can see the entire data frame more conveniently in a new tab:

View(brca_clin_df)

Notice the column names. Let’s define a few:

  • bcr_patient_barcode provides anonymous identifiers for the patients.
  • histological_type describes the ductal invasion.
  • estrogen_receptor_status designates whether the tumor is ER+ or ER-.

We can use indexing to view specific elements of the data frame:

# Look at a single element
# The first number denotes row, the second number denotes column
# The first feature for the patient in row 1
brca_clin_df[1,1]
## [1] "TCGA-3C-AAAU"
# The function`c()` combines values into a list
# The colon operator `:` creates a sequence of numbers
c(1:3,8,15)
## [1]  1  2  3  8 15

Let’s look at the last three features for a group of patients.

brca_clin_df[c(1:3,8,15),25:27]
##    estrogen_receptor_status progesterone_receptor_status her2_receptor_status
## 1                  Positive                     Positive             Negative
## 2                  Positive                     Positive             Positive
## 3                  Positive                     Positive        Indeterminate
## 8                  Positive                     Negative             Negative
## 15                 Positive                     Positive            Equivocal

Practice

Practice 2: Look at a single element

# Look at a single element
# The first number denotes row, the second number denotes column
brca_clin_df[100,15]
## [1] "Infiltrating Lobular Carcinoma"

Practice 3: Look at the data for a single patient.

# Look at a single row or patient
# `t()` transposes the data 
t(brca_clin_df[1,])
##                                         1                                                                                             
## bcr_patient_barcode                     "TCGA-3C-AAAU"                                                                                
## gender                                  "FEMALE"                                                                                      
## race                                    "WHITE"                                                                                       
## ethnicity                               "NOT HISPANIC OR LATINO"                                                                      
## age_at_diagnosis                        "55"                                                                                          
## year_of_initial_pathologic_diagnosis    "2004"                                                                                        
## vital_status                            "Alive"                                                                                       
## menopause_status                        "Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)"
## tumor_status                            "WITH TUMOR"                                                                                  
## margin_status                           "Negative"                                                                                    
## days_to_last_followup                   "4047"                                                                                        
## prior_dx                                "No"                                                                                          
## new_tumor_event_after_initial_treatment "NO"                                                                                          
## radiation_therapy                       "NO"                                                                                          
## histological_type                       "Infiltrating Lobular Carcinoma"                                                              
## pathologic_T                            "TX"                                                                                          
## pathologic_M                            "MX"                                                                                          
## pathologic_N                            "NX"                                                                                          
## pathologic_stage_sub                    "Stage X"                                                                                     
## pathologic_stage                        "X"                                                                                           
## lymph_node_examined_count               "13"                                                                                          
## number_of_lymphnodes_positive           "4"                                                                                           
## initial_diagnosis_method                "[Not Available]"                                                                             
## surgical_procedure                      "Modified Radical Mastectomy"                                                                 
## estrogen_receptor_status                "Positive"                                                                                    
## progesterone_receptor_status            "Positive"                                                                                    
## her2_receptor_status                    "Negative"

Note that the output data are identified with the column names (features) of our data frame. Can you guess what some of the features correspond to?

Google some of the terms, e.g. “pathologic stage” and “initial diagnosis method.” This is a great way to learn more about breast cancer.


Practice 4: Look at ranges of values using the the combine function c() and the colon operator.

# Examine a set of features for a group of patients
brca_clin_df[c(20,30,40),1:5]
##    bcr_patient_barcode gender            race              ethnicity
## 20        TCGA-A1-A0SP FEMALE [Not Available] NOT HISPANIC OR LATINO
## 30        TCGA-A2-A04X FEMALE           WHITE        [Not Available]
## 40        TCGA-A2-A0CT FEMALE           WHITE NOT HISPANIC OR LATINO
##    age_at_diagnosis
## 20               40
## 30               34
## 40               71

Practice 5: More meaningful output

Readability: We can combine functions with text to get more informative output.

Remember that the dim() function provides two numbers:

  • The number of rows
  • The number of columns
# Assign the dimensions of the data frame to an object.

dim_df <- dim(brca_clin_df)  

# Since `dim_df` has two values, we can choose each separately with indexing.
# The function paste() allows us to print text and numbers together.

print(paste("The clinical data frame has",dim_df[1],
            "rows and",dim_df[2],"columns."))
## [1] "The clinical data frame has 1082 rows and 27 columns."

Practice 6: Writing files

What are all of the features in the data frame? We can collect the column names.

# Many function names in R are intuitive.
# colnames gives the names of the columns.

colnames(brca_clin_df)
##  [1] "bcr_patient_barcode"                    
##  [2] "gender"                                 
##  [3] "race"                                   
##  [4] "ethnicity"                              
##  [5] "age_at_diagnosis"                       
##  [6] "year_of_initial_pathologic_diagnosis"   
##  [7] "vital_status"                           
##  [8] "menopause_status"                       
##  [9] "tumor_status"                           
## [10] "margin_status"                          
## [11] "days_to_last_followup"                  
## [12] "prior_dx"                               
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"                      
## [15] "histological_type"                      
## [16] "pathologic_T"                           
## [17] "pathologic_M"                           
## [18] "pathologic_N"                           
## [19] "pathologic_stage_sub"                   
## [20] "pathologic_stage"                       
## [21] "lymph_node_examined_count"              
## [22] "number_of_lymphnodes_positive"          
## [23] "initial_diagnosis_method"               
## [24] "surgical_procedure"                     
## [25] "estrogen_receptor_status"               
## [26] "progesterone_receptor_status"           
## [27] "her2_receptor_status"

Notice there are 27 column names for the 27 columns. By now, you should understand what some of the terms refer to. If not, loop back to Practice 3!

# We can save the feature names to a file for reference!
# `write.csv()` saves the output of our command to a csv file 
# in your working directory.

write.csv(colnames(brca_clin_df),"features.csv", row.names=FALSE)

Check your directory to see if the file is there. Click on the file name to view it. Try this again by removing the argument row.names=FALSE to see what happens.


Examining clinical features

Viewing the data frame gave us an idea of the types of information included, but how do we summarize the values for all 1,082 patients for a given feature, such as gender, vital_status, or estrogen_receptor_status?

Gender

It looks like all of the patients are “FEMALE”, but is this true?

The table() function is VERY useful. It summarizes the data for us. gender is the second column of our data frame.

# When an index value is empty, all values are included
# So we want all rows in our summary table
# 2 refers to the second column
table(brca_clin_df[ , 2])
## 
## FEMALE   MALE 
##   1070     12

What if we don’t know what column gender refers to? Or we don’t feel like counting, especially if we have a large data frame? We know the name of the column, and we can use this information directly.

# The $ operator can be used to select a column with its name
# We can get the column for gender information with `$gender`.
table(brca_clin_df$gender)
## 
## FEMALE   MALE 
##   1070     12

There are a dozen male patients in this cohort.


Age at Diagnosis

The function hist() creates a histogram of numerical data.

A histogram is a diagram with rectangles whose area is proportional to the frequency of a feature and whose width is equal to the feature interval.

Let’s see what values are in the feature age_at_diagnosis.

# The $ operator can be used to select a column with its name.
hist(brca_clin_df$age_at_diagnosis)

For example, there are almost 150 patients between the ages 45 and 50.

As with many measurements in nature, the variable age_at_diagnosis follows a normal distribution (i.e. it is bell curve). Here are some examples.

We should label our axes and add a title that make our plot easier to interpret!

# The $ operator can be used to select a column with its name.
# `xlab` and `main` are arguments to `hist()`.
hist(brca_clin_df$age_at_diagnosis,
     xlab="Age at Diagnosis",      #label the x-axis
     main="Distribution of Ages")  #Add a title

Remember, we can always use the help() function to learn more about what a function does and what arguments it takes.

# Get the help page for the function `hist()`
help(hist)

Estrogen Receptor status

Breast cancer cells are tested to see if they have estrogen receptors. When the hormone estrogen attaches to its receptors, it stimulates the cancer to grow. Cancers are called estrogen receptor-positive (ER+) or estrogen receptor-negative (ER-) based on whether or not they have these receptors.

Knowing the estrogen receptor status is important in deciding among treatment options.

Many breast cancer tumors are estrogen receptor (ER) positive. How many ER+ and ER- tumors are in the TCGA breast cancer cohort?

In the previous section, we saw there is a feature called estrogen_receptor_status.

# The $ operator can be used to select a column with its name.
table(brca_clin_df$estrogen_receptor_status)
## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              48               2             236             796

Some samples do not have information on receptor status, either “[Not Evaluated]” or “Indeterminate”.

It is usually the case with real-life data that not all records (here, patients) have values for all features.

So when working with complex data sets like this, we have to be aware that there may be missing values.


Progesterone Receptor status

Like estrogen, progesterone is a hormone that stimulates breast cancer cells to grow.

Let’s look at the status of the progesterone receptor negative (PR). There’s a data frame column called progesterone_receptor_status.

# The $ operator can be used to select a column with its name.
table(brca_clin_df$progesterone_receptor_status)
## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              49               4             340             689

According to these summaries, there are 796 ER+ samples and 689 PR+ samples.

How many samples have positive status for both estrogen and progesterone receptors? In other words, how many samples have both ER+ and PR+?

We want to find the patients (rows in brca_clin_df) that have

brca_clin_df$estrogen_receptor_status == “Positive”

AND

brca_clin_df$progesterone_receptor_status == “Positive”.

# The operator == finds instances that contain the value "Positive"
# We use the & operator to apply both conditions.
# Save the results to a variable called `ERpos_PR_pos` 
ERpos_PRpos <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive" 
               &
               brca_clin_df$progesterone_receptor_status == "Positive"),  ]

# How many patients (rows) have both ER+ and PR+?
nrow(ERpos_PRpos)
## [1] 672

The overlap between the 796 ER+ tumors and 689 PR+ tumors is 672.

Many breast cancer tumors are both ER+ and PR+.

We can make this output more readable for our report. Remember, we already saved this data frame result to the object ERpos_PRpos.

# The function paste() allows us to print text and numbers together.

print(paste("The number of samples with both ER+ and PR+ is", nrow(ERpos_PRpos)))
## [1] "The number of samples with both ER+ and PR+ is 672"

HER2 status

HER2 is the gene name for the Receptor tyrosine-protein kinase erbB-2; it helps control cell growth.

There is a data feature called “her2_receptor_status.”
(Remember, you can see this by using the colnames() function as we did above.)

# The $ operator can be used to select a column with its name.
table(brca_clin_df$her2_receptor_status)
## 
## [Not Available] [Not Evaluated]       Equivocal   Indeterminate        Negative 
##               8             170             177              12             554 
##        Positive 
##             161

In addition to the missing values we found for ER and PR ([Not Available] and Indeterminate), HER2 status can also be “Equivocal” and “Indeterminate.”


Triple Negative Breast Cancer

Triple negative breast cancer samples are

  • estrogen receptor negative (ER-)
  • progesterone receptor negative (PR-)
  • HER2 negative (HER2-).

Triple negative breast cancer is agressive and difficult to treat.

How many samples have “Negative” status in all three cases?

The pieces of information we need are in the columns called

  • estrogen_receptor_status
  • progesterone_receptor_status
  • her2_receptor_status

Let’s make a data frame receptor_status that contains the sample names (bcr_patient_barcode) and the three receptor status features. We can do this with indexing or by pulling columns out of the data frame by column name.

This code chunk does a lot, so take your time with it. And we learn a new function cbind().

# The $ operator can be used to select a column with its name.
sample_id <- brca_clin_df$bcr_patient_barcode
er_status <- brca_clin_df$estrogen_receptor_status
pr_status <- brca_clin_df$progesterone_receptor_status
her2_status <- brca_clin_df$her2_receptor_status

# `cbind()` takes each vector as a column in a new data frame
receptor_status<- cbind(sample_id,er_status,pr_status,her2_status)

The data frame receptor_status contains a subset of the full data frame brca_clin_df.

# We can perform operations on the new dataframe
dim(receptor_status)
## [1] 1082    4
head(receptor_status)
##      sample_id      er_status  pr_status  her2_status    
## [1,] "TCGA-3C-AAAU" "Positive" "Positive" "Negative"     
## [2,] "TCGA-3C-AALI" "Positive" "Positive" "Positive"     
## [3,] "TCGA-3C-AALJ" "Positive" "Positive" "Indeterminate"
## [4,] "TCGA-3C-AALK" "Positive" "Positive" "Positive"     
## [5,] "TCGA-4H-AAAK" "Positive" "Positive" "Equivocal"    
## [6,] "TCGA-5L-AAT0" "Positive" "Positive" "Negative"

From receptor_status, we’ll create another data frame tnbc (for triple negative breast cancer) that contains only those samples that have “Negative” status for all three receptors.

Again, we use the logical operators == and &.

tnbc <- receptor_status[(receptor_status[,2]=="Negative"&
                        receptor_status[,3]=="Negative"&
                        receptor_status[,4]=="Negative"),]

# Let's check out the new dataframe tnbc
head(tnbc)
##      sample_id      er_status  pr_status  her2_status
## [1,] "TCGA-A1-A0SK" "Negative" "Negative" "Negative" 
## [2,] "TCGA-A1-A0SP" "Negative" "Negative" "Negative" 
## [3,] "TCGA-A2-A04U" "Negative" "Negative" "Negative" 
## [4,] "TCGA-A2-A0CM" "Negative" "Negative" "Negative" 
## [5,] "TCGA-A2-A0D0" "Negative" "Negative" "Negative" 
## [6,] "TCGA-A2-A0D2" "Negative" "Negative" "Negative"
nrow(tnbc)
## [1] 114

So overall, 114 samples out of 1082 total samples correspond to Triple Negative Breast Cancer, the most aggressive type of primary breast cancer.

Histology

Create a table for “histological_type”.

table(brca_clin_df$histological_type)
## 
##                  [Not Available]       Infiltrating Carcinoma NOS 
##                                1                                1 
##    Infiltrating Ductal Carcinoma   Infiltrating Lobular Carcinoma 
##                              774                              201 
##              Medullary Carcinoma            Metaplastic Carcinoma 
##                                6                                8 
## Mixed Histology (please specify)               Mucinous Carcinoma 
##                               29                               17 
##                   Other, specify 
##                               45

Most of the breast cancer samples are either “Infiltrating Ductal Carcinoma” (774 samples) or “Infiltrating Lobular Carcinoma” (201 samples), the two most common types of breast cancer. Refer back to the figure “Microscopy Images of Ducts.”

We can look for a specific instance of an histological type by indexing on type name.

# Save this number to a new object
N_ductal <- table(brca_clin_df$histological_type)["Infiltrating Ductal Carcinoma"]

print(N_ductal)
## Infiltrating Ductal Carcinoma 
##                           774

DREAM-High Challenge 1

Stage II describes invasive cancer that is contained in the breast and/or the growth has extended to nearby lymph nodes.

Read more about breast cancer staging here.

How many patients have stage II cancer?

# First find out what the values are for `pathologic_stage`
table(brca_clin_df$pathologic_stage)
## 
## [Not vailable]              I             II            III             IV 
##              5            179            615            250             19 
##              X 
##             14

X denotes indeterminate stage.

# Uncomment the code lines below, and replace xxxxxx with the correct feature name and yyyyyy with the data fram name for stage 2

# N_stage2 <- table(brca_clin_df$xxxxxx)["YY"]

# print(N_stage2)

How would you report your result? Use the paste() function, and a combination of text and variables to describe what you found.

DREAM-High Challenge 2

What is the tumor_status across patients?

table(brca_clin_df$tumor_status)
## 
## TUMOR FREE WITH TUMOR 
##        923        123

How many samples are “WITH TUMOR”?

# uncomment the code and assign the correct value to `N_tumor`
# Refer to the examples above, which are similar.
# N_tumor <- 

How many patients with tumor_status = “WITH TUMOR” also have pathologic.stage = “II”?

First, create a data frame with the samples that meet these criteria.

# Uncomment the next three lines of code and replace YY and XXXXX with the correct values

# tumor_II<-brca_clin_df[(brca_clin_df$pathologic_stage=="YY"
#                 &
#                  brca_clin_df$tumor_status=="XXXXX"),]

Then, calculate the size of this cohort.

# Choose a function ffff to determine how many rows (patients) meet this condition

# ffff(tumor_II)

Assign your result.

# Uncomment and assign the value you obtain

# N_tumor_II <-

Finally, report your result using the paste() function and a combination of text and variables.

You will have found that about half of the tumors in this cohort are stage II.

Stage II breast cancer is usually treated by surgical removal followed by radiation.

Wrapping up

Great work!!.

We have learned a lot about breast cancer by examining data systematically collected from patients. It is great that these patients gave their consent to contribute to TCGA. The data is anonymized so that it cannot be traced back to any particular person. TCGA data is publicly available and used by researchers worldwide to improve our ability to diagnose, treat, and prevent cancer.

You can learn more about TCGA here.

In future sessions, we will be introduced to the gene expression data for these patients.