In this activity, I learned new skills in R with a large real-life dataset:

I will load and examine a data frame that contains clinical information from over 1,000 breast cancer patients from The Cancer Genome Atlas (TCGA).

TCGA characterized over 20,000 cancer samples spanning 33 cancer types with genomics.

Genomics is a field of biology that studies the entire set of DNA (the genome) in an organism (such as humans!), including all its genes and how they interact with each other and the environment. Learn more here.

In the next few activities, I will examine some of the different data types and the computational analyses that were performed to decipher breast cancer data from TCGA.


Preliminaries

Preliminaries are the tasks that I needed to do before I began my analysis.

Load packages

Many users and developers contribute packages to R that contain useful functionality not available with the default R distribution. At the beginning of my activities, I load the special packages that I need with the library() function.


The data directory

All the data - including datasets and images - is located in the directory /home/data/, and I will include this in the *path pointing to my files.


Breast Cancer Basics

Breast cancer most often begins in the ducts and the lobules, the glands that create milk.

In breast cancer, the cells lining the ducts and lobules grow and divide abnormally.

# Images are matrices or arrays of pixel values.
duct_cells_img<-load.image("/home/data/duct_cells.jpg")

# The function `par()` defines parameters for plotting
#par(font.sub=4, mar=c(0,0,0,0))

plot(duct_cells_img,axes=FALSE)

I will be examining patient data for different types of invasive (or infiltrating) cancers. Invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two most common types of invasive breast cancer.


Metastasis involves the development of secondary malignant growths at a distance from a primary site of cancer.

# Images are matrices or arrays of pixel values
meta_img<-load.image("/home/data/metastasis.jpg")

# The function `par()` defines parameters for plotting
par(font.sub=4, mar=c(0,0,0,0))

plot(meta_img,axes=FALSE)


Loading data

Throughout the course, I will be loading data from R files with the extension .RData. An example is the file brca_clin.RData which I will work with in this activity.

.RData files are binary files (i.e., not human readable) that hold R objects, such as data frames.

# The `load()` function reads in the data file "/home/data/brca_clin.RData".
# The argument "verbose = TRUE" makes `load()` tells me what R objects are being loaded.
# The objects will also appear in my "Environment" tab.

load("/home/data/brca_clin.RData",verbose=TRUE)   
## Loading objects:
##   brca_clin_df

The object brca_clin_df was named to be be descriptive.

Data frames are data displayed as a table.

In data science, columns in a data frame are features and rows are records or observations.


Viewing the data

The Environment tab gives me basic information on the objects that were loaded.

There are 1082 observations (patients) with 27 variables (clinical features).

I can use some simple R functions to examine the data in more detail.

# How many rows and columns are in my object?
dim(brca_clin_df)
## [1] 1082   27

Check the Environments tab to check if the output is what was expected.

# Take a first look at the data in my object
# `head` gives me the first six rows and ALL of the columns
head(brca_clin_df)

Scroll through the columns by clicking on the arrow in the top right corner of the table. I can see the entire data frame more conveniently in a new tab:

# Get a global view of the data
View(brca_clin_df)

Here are a few definitions of the column names:


Indexing

I can use indexing to view specific elements of the data frame:

# Look at a single element
# The first number denotes row, the second number denotes column
# The first feature for the patient in row 1
brca_clin_df[1,1]
## [1] "TCGA-3C-AAAU"

The function c() combines or concatenates objects into a single list.

# The function`c()` combines values into a list
# The colon operator `:` creates a sequence of numbers
c(1:3,8,15)
## [1]  1  2  3  8 15

Let’s look at the last three features for a random group of patients.

brca_clin_df[c(1:3,8,15),25:27]
##    estrogen_receptor_status progesterone_receptor_status her2_receptor_status
## 1                  Positive                     Positive             Negative
## 2                  Positive                     Positive             Positive
## 3                  Positive                     Positive        Indeterminate
## 8                  Positive                     Negative             Negative
## 15                 Positive                     Positive            Equivocal

Practice

Here are some exercises that I completed to get used to extracting rows and columns from the tables, I experimented by changing the values for indexing.

Practice 1: Indexing

Change the index numbers in the code chunks above to explore the data frame.

Practice 2: Data for a single patient

# Look at a single row or patient
# `t()` transposes the data - switches the row to a column 
# I used it here to make the result easier to read
t(brca_clin_df[1,])
##                                         1                                                                                             
## bcr_patient_barcode                     "TCGA-3C-AAAU"                                                                                
## gender                                  "FEMALE"                                                                                      
## race                                    "WHITE"                                                                                       
## ethnicity                               "NOT HISPANIC OR LATINO"                                                                      
## age_at_diagnosis                        "55"                                                                                          
## year_of_initial_pathologic_diagnosis    "2004"                                                                                        
## vital_status                            "Alive"                                                                                       
## menopause_status                        "Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)"
## tumor_status                            "WITH TUMOR"                                                                                  
## margin_status                           "Negative"                                                                                    
## days_to_last_followup                   "4047"                                                                                        
## prior_dx                                "No"                                                                                          
## new_tumor_event_after_initial_treatment "NO"                                                                                          
## radiation_therapy                       "NO"                                                                                          
## histological_type                       "Infiltrating Lobular Carcinoma"                                                              
## pathologic_T                            "TX"                                                                                          
## pathologic_M                            "MX"                                                                                          
## pathologic_N                            "NX"                                                                                          
## pathologic_stage_sub                    "Stage X"                                                                                     
## pathologic_stage                        "X"                                                                                           
## lymph_node_examined_count               "13"                                                                                          
## number_of_lymphnodes_positive           "4"                                                                                           
## initial_diagnosis_method                "[Not Available]"                                                                             
## surgical_procedure                      "Modified Radical Mastectomy"                                                                 
## estrogen_receptor_status                "Positive"                                                                                    
## progesterone_receptor_status            "Positive"                                                                                    
## her2_receptor_status                    "Negative"
# I noticied that the output data are identified with the column names (features) of my data frame. To see what some of the features correspond to, I simply Googled some of the terms, e.g. "pathologic stage" and "initial diagnosis method" to learn more about breast cancer.

Practice 3: Sample the data

I can examine data for multiple patients and features using the the combine function c() and the colon operator.

# Look at a few features for several patients
brca_clin_df[c(20,30,40),1:5]
##    bcr_patient_barcode gender            race              ethnicity
## 20        TCGA-A1-A0SP FEMALE [Not Available] NOT HISPANIC OR LATINO
## 30        TCGA-A2-A04X FEMALE           WHITE        [Not Available]
## 40        TCGA-A2-A0CT FEMALE           WHITE NOT HISPANIC OR LATINO
##    age_at_diagnosis
## 20               40
## 30               34
## 40               71

Practice 4: More meaningful output

Readability: I can combine functions with text to get more informative output.

The dim() function provides two numbers:

  • The number of rows
  • The number of columns
# Assign the dimensions of the data frame to an object.
dim_df <- dim(brca_clin_df)  

print(dim_df)
## [1] 1082   27

How do I get just the number of rows?

# dim_df contains two elements. I can get the first and the second
# with indexing
dim_df[1]
## [1] 1082
dim_df[2]
## [1] 27
# Since `dim_df` has two values, I can choose each separately with indexing.
# The function paste() allows me to print text and numbers together.

print(paste("The clinical data frame has",dim_df[1],
            "rows and",dim_df[2],"columns."))
## [1] "The clinical data frame has 1082 rows and 27 columns."

Practice 5: Writing files

To find all the features in the data frame I can collect the column names.

# Many function names in R are intuitive.
# colnames gives the names of the columns.

colnames(brca_clin_df)
##  [1] "bcr_patient_barcode"                    
##  [2] "gender"                                 
##  [3] "race"                                   
##  [4] "ethnicity"                              
##  [5] "age_at_diagnosis"                       
##  [6] "year_of_initial_pathologic_diagnosis"   
##  [7] "vital_status"                           
##  [8] "menopause_status"                       
##  [9] "tumor_status"                           
## [10] "margin_status"                          
## [11] "days_to_last_followup"                  
## [12] "prior_dx"                               
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"                      
## [15] "histological_type"                      
## [16] "pathologic_T"                           
## [17] "pathologic_M"                           
## [18] "pathologic_N"                           
## [19] "pathologic_stage_sub"                   
## [20] "pathologic_stage"                       
## [21] "lymph_node_examined_count"              
## [22] "number_of_lymphnodes_positive"          
## [23] "initial_diagnosis_method"               
## [24] "surgical_procedure"                     
## [25] "estrogen_receptor_status"               
## [26] "progesterone_receptor_status"           
## [27] "her2_receptor_status"

There are 27 column names for the 27 columns. By now, I understand what some of the terms refer to. But if I forget, I can loop back to Practice 2.

# I can save the feature names to a file for reference!
# `write.csv()` saves the output of my command to a csv file 
# in my working directory.

write.csv(colnames(brca_clin_df),"features.csv", row.names=FALSE)

# I checked my directory to see if the file is there by clicking on the file name to view it. I tried this again by removing the argument `row.names=FALSE` to see what happens.

Age at Diagnosis

The function hist() creates a histogram for numerical data. I want to use this for the features that are continuous rather than categorical.

A histogram is a diagram with rectangles whose area is proportional to the frequency of a feature and whose width is equal to the feature interval.

Let’s see what values are in the feature age_at_diagnosis.

# The $ operator can be used to select a column with its name.
hist(brca_clin_df$age_at_diagnosis)

For example, there are almost 150 patients between the ages 45 and 50.

As with many measurements in nature, the variable age_at_diagnosis follows a normal distribution (i.e. it is bell curve). Here are some examples.

I will label my axes and add a title that make my plot easier to interpret!

# The $ operator can be used to select a column with its name.
# `xlab` and `main` are arguments to `hist()`.
hist(brca_clin_df$age_at_diagnosis,
     xlab="Age at Diagnosis",      #label the x-axis
     main="Distribution of Ages")  #Add a title

# I can always use the `help()` function to learn more about what a function does and what arguments it takes.
# Get the help page for the function `hist()`
help(hist)

An aside: The Central Limit Theorem

The Central Limit Theorem is a fundamental concept in statistics: For a variable that is influenced by many small, independent factors, the overall distribution of that variable will tend towards a normal distribution.

Many (not all!) natural phenomena can be seen as the result of numerous, small, and relatively independent contributing factors. For example, human height is influenced by genetics, nutrition, environment, and other factors. Each of these factors contributes a small amount to the overall height, and because they are largely independent, the resulting distribution tends to be approximately normal.

Often natural data are skewed where the values cluster more on one side of the distribution, creating a “tail” on either the left or the right.

Progesterone Receptor status

Like estrogen, progesterone is a hormone that stimulates breast cancer cells to grow.

I am going to look at the status of the progesterone receptor negative (PR). There’s a data frame column called progesterone_receptor_status.

# The $ operator can be used to select a column with its name.
table(brca_clin_df$progesterone_receptor_status)
## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              49               4             340             689

According to these summaries, there are 796 ER+ samples and 689 PR+ samples.


ER+/PR+ samples

How many samples have positive status for both estrogen and progesterone receptors? In other words, how many samples have both ER+ and PR+?

I want to find the patients (rows in brca_clin_df) that have

brca_clin_df$estrogen_receptor_status == “Positive”

AND

brca_clin_df$progesterone_receptor_status == “Positive”.

# The operator == finds instances that contain the value "Positive"
# I use the & operator to apply both conditions.
# Save the results to an object called `ERpos_PR_pos` 
ERpos_PRpos <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive" 
               &
               brca_clin_df$progesterone_receptor_status == "Positive"),  ]

# How many patients (rows) have both ER+ and PR+?
nrow(ERpos_PRpos)
## [1] 672

The overlap between the 796 ER+ tumors and 689 PR+ tumors is 672.

Many breast cancer tumors are both ER+ and PR+.

I can make this output more readable for my report. Remember, I already saved this data frame result to the object ERpos_PRpos.

# The function paste() allows me to print text and numbers together.

print(paste("The number of samples with both ER+ and PR+ is", nrow(ERpos_PRpos)))
## [1] "The number of samples with both ER+ and PR+ is 672"

HER2 status

HER2 is the gene name for the Receptor tyrosine-protein kinase erbB-2; it helps control cell growth.

There is a data feature called “her2_receptor_status.”
(Remember, I can see this by using the colnames() function as I did above, or I can look at the brca_clin_df table I created with the View() function.)

# The $ operator can be used to select a column with its name.
table(brca_clin_df$her2_receptor_status)
## 
## [Not Available] [Not Evaluated]       Equivocal   Indeterminate        Negative 
##               8             170             177              12             554 
##        Positive 
##             161

In addition to the missing values I found for ER and PR ([Not Evaluated] and Indeterminate), HER2 status can also be [Not Available] and “Equivocal.”


Triple Negative Breast Cancer

Triple negative breast cancer samples have “Negative” values for

Triple negative breast cancer is agressive and difficult to treat.

How many samples have “Negative” status in all three cases?

The pieces of information I need are in the columns called

Let’s make a data frame tnbc that contains information about the patients with triple negative breast cancer.

# The operator == finds instances that contain the value "Positive"
# I can use the & operator to apply both conditions.
# Save the results to an object called `ERpos_PR_pos` 
tnbc <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Negative" 
                      &
                      brca_clin_df$progesterone_receptor_status == "Negative"
                      &
                      brca_clin_df$her2_receptor_status == "Negative"),]

# How many patients (rows) have both ER+ and PR+?
nrow(tnbc)
## [1] 114

The data frame tnbc contains a subset of the full data frame brca_clin_df. I can examine the data and perform summary analyses with the table() and hist() functions.

# Examine features for patients with triple negative breast cancer
View(tnbc)
# What is the gender distribution for triple negative breast cancer
table(tnbc$gender)
## 
## FEMALE 
##    114

So overall, 114 samples out of 1082 total samples correspond to Triple Negative Breast Cancer, the most aggressive type of primary breast cancer, and all of the samples are from FEMALE patients.


Breast cancer histology

Create a table for “histological_type”.

table(brca_clin_df$histological_type)
## 
##                  [Not Available]       Infiltrating Carcinoma NOS 
##                                1                                1 
##    Infiltrating Ductal Carcinoma   Infiltrating Lobular Carcinoma 
##                              774                              201 
##              Medullary Carcinoma            Metaplastic Carcinoma 
##                                6                                8 
## Mixed Histology (please specify)               Mucinous Carcinoma 
##                               29                               17 
##                   Other, specify 
##                               45

Most of the breast cancer samples are either “Infiltrating Ductal Carcinoma” (774 samples) or “Infiltrating Lobular Carcinoma” (201 samples), the two most common types of breast cancer. Infilatrating corresponds to invasive. Refer back to the figure “Microscopy Images of Ducts.”

I can look for a specific instance of an histological type by indexing on type name, find its frequency, and the percentage of cases it represents.

# Save the number IDC samples to a new object
N_ductal <- table(brca_clin_df$histological_type)["Infiltrating Ductal Carcinoma"]

percent_ductal <- round( N_ductal/1082 * 100 , 0)

print(paste(percent_ductal,"% of breast cancer patients have Infiltrating Ductal Carcinoma", sep=""))
## [1] "72% of breast cancer patients have Infiltrating Ductal Carcinoma"

DREAM-High Challenge: Pathologic Stage

I am going to try to find out how many patients have stage II cancer.

Stage II describes invasive cancer that is contained in the breast and/or the growth has extended to nearby lymph nodes.

Read more about breast cancer staging here.

# First find out what the values are for `pathologic_stage`
 table(brca_clin_df$pathologic_stage)
## 
## [Not vailable]              I             II            III             IV 
##              5            179            615            250             19 
##              X 
##             14

X denotes indeterminate stage.

# How many patients have stage II cancer?

 N_stageII <- table(brca_clin_df$pathologic_stage)["II"]

print(N_stageII)
##  II 
## 615
# I need to report my result by using the `paste()` function to describe what I found.
print(paste("The number of patients with Stage II breast cancer:",N_stageII))
## [1] "The number of patients with Stage II breast cancer: 615"

What percentage of patients have stage II breast cancer?

# Find code above to calculate the percentage of stage II samples
# from N_stageII and the total number of patients.
# Print results.

percent_stageII <- round( N_stageII/1082 * 100 , 0)

print (paste("The percentage of patients with Stage II breast cancer: ",percent_stageII,"%", sep=""))
## [1] "The percentage of patients with Stage II breast cancer: 57%"

Wrapping up

I have learned a lot about breast cancer by examining data systematically collected from patients. It is great that these patients gave their consent to contribute to TCGA. The data is anonymized so that it cannot be traced back to any particular person. TCGA data is publicly available and used by researchers worldwide to improve our ability to diagnose, treat, and prevent cancer.

We can learn more about TCGA here.