In this activity, I learned new skills in R with a large real-life dataset:
I will load and examine a data frame that contains clinical information from over 1,000 breast cancer patients from The Cancer Genome Atlas (TCGA).
TCGA characterized over 20,000 cancer samples spanning 33 cancer types with genomics.
Genomics is a field of biology that studies the entire set of DNA (the genome) in an organism (such as humans!), including all its genes and how they interact with each other and the environment. Learn more here.
In the next few activities, I will examine some of the different data types and the computational analyses that were performed to decipher breast cancer data from TCGA.
Preliminaries are the tasks that I needed to do before I began my analysis.
Many users and developers contribute packages to R that contain useful functionality not available with the default R distribution. At the beginning of my activities, I load the special packages that I need with the library()
function.
All the data - including datasets and images - is located in the directory /home/data/
, and I will include this in the *path pointing to my files.
Breast cancer most often begins in the ducts and the lobules, the glands that create milk.
In breast cancer, the cells lining the ducts and lobules grow and divide abnormally.
# Images are matrices or arrays of pixel values.
duct_cells_img<-load.image("/home/data/duct_cells.jpg")
# The function `par()` defines parameters for plotting
#par(font.sub=4, mar=c(0,0,0,0))
plot(duct_cells_img,axes=FALSE)
I will be examining patient data for different types of invasive (or infiltrating) cancers. Invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two most common types of invasive breast cancer.
Metastasis involves the development of secondary malignant growths at a distance from a primary site of cancer.
# Images are matrices or arrays of pixel values
meta_img<-load.image("/home/data/metastasis.jpg")
# The function `par()` defines parameters for plotting
par(font.sub=4, mar=c(0,0,0,0))
plot(meta_img,axes=FALSE)
Throughout the course, I will be loading data from R files with the extension .RData
. An example is the file brca_clin.RData
which I will work with in this activity.
.RData
files are binary files (i.e., not human readable) that hold R objects, such as data frames.
# The `load()` function reads in the data file "/home/data/brca_clin.RData".
# The argument "verbose = TRUE" makes `load()` tells me what R objects are being loaded.
# The objects will also appear in my "Environment" tab.
load("/home/data/brca_clin.RData",verbose=TRUE)
## Loading objects:
## brca_clin_df
The object brca_clin_df
was named to be be descriptive.
Data frames are data displayed as a table.
In data science, columns in a data frame are features and rows are records or observations.
The Environment tab gives me basic information on the objects that were loaded.
There are 1082 observations (patients) with 27 variables (clinical features).
I can use some simple R functions to examine the data in more detail.
dim()
tells me the dimensions (# rows, # columns) of the object.head()
returns the first several rows of the objects.# How many rows and columns are in my object?
dim(brca_clin_df)
## [1] 1082 27
Check the Environments tab to check if the output is what was expected.
# Take a first look at the data in my object
# `head` gives me the first six rows and ALL of the columns
head(brca_clin_df)
Scroll through the columns by clicking on the arrow in the top right corner of the table. I can see the entire data frame more conveniently in a new tab:
# Get a global view of the data
View(brca_clin_df)
Here are a few definitions of the column names:
bcr_patient_barcode
provides anonymous identifiers for the patients.histological_type
describes the ductal invasion.estrogen_receptor_status
designates whether the tumor is ER+ or ER-.I can use indexing to view specific elements of the data frame:
# Look at a single element
# The first number denotes row, the second number denotes column
# The first feature for the patient in row 1
brca_clin_df[1,1]
## [1] "TCGA-3C-AAAU"
The function c()
combines or concatenates objects into a single list.
# The function`c()` combines values into a list
# The colon operator `:` creates a sequence of numbers
c(1:3,8,15)
## [1] 1 2 3 8 15
Let’s look at the last three features for a random group of patients.
brca_clin_df[c(1:3,8,15),25:27]
## estrogen_receptor_status progesterone_receptor_status her2_receptor_status
## 1 Positive Positive Negative
## 2 Positive Positive Positive
## 3 Positive Positive Indeterminate
## 8 Positive Negative Negative
## 15 Positive Positive Equivocal
Here are some exercises that I completed to get used to extracting rows and columns from the tables, I experimented by changing the values for indexing.
Change the index numbers in the code chunks above to explore the data frame.
# Look at a single row or patient
# `t()` transposes the data - switches the row to a column
# I used it here to make the result easier to read
t(brca_clin_df[1,])
## 1
## bcr_patient_barcode "TCGA-3C-AAAU"
## gender "FEMALE"
## race "WHITE"
## ethnicity "NOT HISPANIC OR LATINO"
## age_at_diagnosis "55"
## year_of_initial_pathologic_diagnosis "2004"
## vital_status "Alive"
## menopause_status "Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)"
## tumor_status "WITH TUMOR"
## margin_status "Negative"
## days_to_last_followup "4047"
## prior_dx "No"
## new_tumor_event_after_initial_treatment "NO"
## radiation_therapy "NO"
## histological_type "Infiltrating Lobular Carcinoma"
## pathologic_T "TX"
## pathologic_M "MX"
## pathologic_N "NX"
## pathologic_stage_sub "Stage X"
## pathologic_stage "X"
## lymph_node_examined_count "13"
## number_of_lymphnodes_positive "4"
## initial_diagnosis_method "[Not Available]"
## surgical_procedure "Modified Radical Mastectomy"
## estrogen_receptor_status "Positive"
## progesterone_receptor_status "Positive"
## her2_receptor_status "Negative"
# I noticied that the output data are identified with the column names (features) of my data frame. To see what some of the features correspond to, I simply Googled some of the terms, e.g. "pathologic stage" and "initial diagnosis method" to learn more about breast cancer.
I can examine data for multiple patients and features using the the combine function c()
and the colon operator.
# Look at a few features for several patients
brca_clin_df[c(20,30,40),1:5]
## bcr_patient_barcode gender race ethnicity
## 20 TCGA-A1-A0SP FEMALE [Not Available] NOT HISPANIC OR LATINO
## 30 TCGA-A2-A04X FEMALE WHITE [Not Available]
## 40 TCGA-A2-A0CT FEMALE WHITE NOT HISPANIC OR LATINO
## age_at_diagnosis
## 20 40
## 30 34
## 40 71
Readability: I can combine functions with text to get more informative output.
The dim()
function provides two numbers:
# Assign the dimensions of the data frame to an object.
dim_df <- dim(brca_clin_df)
print(dim_df)
## [1] 1082 27
How do I get just the number of rows?
# dim_df contains two elements. I can get the first and the second
# with indexing
dim_df[1]
## [1] 1082
dim_df[2]
## [1] 27
# Since `dim_df` has two values, I can choose each separately with indexing.
# The function paste() allows me to print text and numbers together.
print(paste("The clinical data frame has",dim_df[1],
"rows and",dim_df[2],"columns."))
## [1] "The clinical data frame has 1082 rows and 27 columns."
To find all the features in the data frame I can collect the column names.
# Many function names in R are intuitive.
# colnames gives the names of the columns.
colnames(brca_clin_df)
## [1] "bcr_patient_barcode"
## [2] "gender"
## [3] "race"
## [4] "ethnicity"
## [5] "age_at_diagnosis"
## [6] "year_of_initial_pathologic_diagnosis"
## [7] "vital_status"
## [8] "menopause_status"
## [9] "tumor_status"
## [10] "margin_status"
## [11] "days_to_last_followup"
## [12] "prior_dx"
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"
## [15] "histological_type"
## [16] "pathologic_T"
## [17] "pathologic_M"
## [18] "pathologic_N"
## [19] "pathologic_stage_sub"
## [20] "pathologic_stage"
## [21] "lymph_node_examined_count"
## [22] "number_of_lymphnodes_positive"
## [23] "initial_diagnosis_method"
## [24] "surgical_procedure"
## [25] "estrogen_receptor_status"
## [26] "progesterone_receptor_status"
## [27] "her2_receptor_status"
There are 27 column names for the 27 columns. By now, I understand what some of the terms refer to. But if I forget, I can loop back to Practice 2.
# I can save the feature names to a file for reference!
# `write.csv()` saves the output of my command to a csv file
# in my working directory.
write.csv(colnames(brca_clin_df),"features.csv", row.names=FALSE)
# I checked my directory to see if the file is there by clicking on the file name to view it. I tried this again by removing the argument `row.names=FALSE` to see what happens.
The function hist()
creates a histogram for numerical data. I want to use this for the features that are continuous rather than categorical.
A histogram is a diagram with rectangles whose area is proportional to the frequency of a feature and whose width is equal to the feature interval.
Let’s see what values are in the feature age_at_diagnosis
.
# The $ operator can be used to select a column with its name.
hist(brca_clin_df$age_at_diagnosis)
For example, there are almost 150 patients between the ages 45 and 50.
As with many measurements in nature, the variable age_at_diagnosis
follows a normal distribution (i.e. it is bell curve). Here are some examples.
I will label my axes and add a title that make my plot easier to interpret!
# The $ operator can be used to select a column with its name.
# `xlab` and `main` are arguments to `hist()`.
hist(brca_clin_df$age_at_diagnosis,
xlab="Age at Diagnosis", #label the x-axis
main="Distribution of Ages") #Add a title
# I can always use the `help()` function to learn more about what a function does and what arguments it takes.
# Get the help page for the function `hist()`
help(hist)
The Central Limit Theorem is a fundamental concept in statistics: For a variable that is influenced by many small, independent factors, the overall distribution of that variable will tend towards a normal distribution.
Many (not all!) natural phenomena can be seen as the result of numerous, small, and relatively independent contributing factors. For example, human height is influenced by genetics, nutrition, environment, and other factors. Each of these factors contributes a small amount to the overall height, and because they are largely independent, the resulting distribution tends to be approximately normal.
Often natural data are skewed where the values cluster more on one side of the distribution, creating a “tail” on either the left or the right.
Like estrogen, progesterone is a hormone that stimulates breast cancer cells to grow.
I am going to look at the status of the progesterone receptor negative (PR). There’s a data frame column called progesterone_receptor_status
.
# The $ operator can be used to select a column with its name.
table(brca_clin_df$progesterone_receptor_status)
##
## [Not Evaluated] Indeterminate Negative Positive
## 49 4 340 689
According to these summaries, there are 796 ER+ samples and 689 PR+ samples.
How many samples have positive status for both estrogen and progesterone receptors? In other words, how many samples have both ER+ and PR+?
I want to find the patients (rows in brca_clin_df
) that have
brca_clin_df$estrogen_receptor_status == “Positive”
AND
brca_clin_df$progesterone_receptor_status == “Positive”.
# The operator == finds instances that contain the value "Positive"
# I use the & operator to apply both conditions.
# Save the results to an object called `ERpos_PR_pos`
ERpos_PRpos <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive"
&
brca_clin_df$progesterone_receptor_status == "Positive"), ]
# How many patients (rows) have both ER+ and PR+?
nrow(ERpos_PRpos)
## [1] 672
The overlap between the 796 ER+ tumors and 689 PR+ tumors is 672.
Many breast cancer tumors are both ER+ and PR+.
I can make this output more readable for my report. Remember, I already saved this data frame result to the object ERpos_PRpos
.
# The function paste() allows me to print text and numbers together.
print(paste("The number of samples with both ER+ and PR+ is", nrow(ERpos_PRpos)))
## [1] "The number of samples with both ER+ and PR+ is 672"
HER2 is the gene name for the Receptor tyrosine-protein kinase erbB-2; it helps control cell growth.
There is a data feature called “her2_receptor_status.”
(Remember, I can see this by using the colnames()
function as I did above, or I can look at the brca_clin_df table I created with the View()
function.)
# The $ operator can be used to select a column with its name.
table(brca_clin_df$her2_receptor_status)
##
## [Not Available] [Not Evaluated] Equivocal Indeterminate Negative
## 8 170 177 12 554
## Positive
## 161
In addition to the missing values I found for ER and PR ([Not Evaluated] and Indeterminate), HER2 status can also be [Not Available] and “Equivocal.”
Triple negative breast cancer samples have “Negative” values for
Triple negative breast cancer is agressive and difficult to treat.
How many samples have “Negative” status in all three cases?
The pieces of information I need are in the columns called
Let’s make a data frame tnbc
that contains information about the patients with triple negative breast cancer.
# The operator == finds instances that contain the value "Positive"
# I can use the & operator to apply both conditions.
# Save the results to an object called `ERpos_PR_pos`
tnbc <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Negative"
&
brca_clin_df$progesterone_receptor_status == "Negative"
&
brca_clin_df$her2_receptor_status == "Negative"),]
# How many patients (rows) have both ER+ and PR+?
nrow(tnbc)
## [1] 114
The data frame tnbc
contains a subset of the full data frame brca_clin_df
. I can examine the data and perform summary analyses with the table()
and hist()
functions.
# Examine features for patients with triple negative breast cancer
View(tnbc)
# What is the gender distribution for triple negative breast cancer
table(tnbc$gender)
##
## FEMALE
## 114
So overall, 114 samples out of 1082 total samples correspond to Triple Negative Breast Cancer, the most aggressive type of primary breast cancer, and all of the samples are from FEMALE
patients.
Create a table for “histological_type”.
table(brca_clin_df$histological_type)
##
## [Not Available] Infiltrating Carcinoma NOS
## 1 1
## Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma
## 774 201
## Medullary Carcinoma Metaplastic Carcinoma
## 6 8
## Mixed Histology (please specify) Mucinous Carcinoma
## 29 17
## Other, specify
## 45
Most of the breast cancer samples are either “Infiltrating Ductal Carcinoma” (774 samples) or “Infiltrating Lobular Carcinoma” (201 samples), the two most common types of breast cancer. Infilatrating corresponds to invasive. Refer back to the figure “Microscopy Images of Ducts.”
I can look for a specific instance of an histological type by indexing on type name, find its frequency, and the percentage of cases it represents.
# Save the number IDC samples to a new object
N_ductal <- table(brca_clin_df$histological_type)["Infiltrating Ductal Carcinoma"]
percent_ductal <- round( N_ductal/1082 * 100 , 0)
print(paste(percent_ductal,"% of breast cancer patients have Infiltrating Ductal Carcinoma", sep=""))
## [1] "72% of breast cancer patients have Infiltrating Ductal Carcinoma"
I am going to try to find out how many patients have stage II cancer.
Stage II describes invasive cancer that is contained in the breast and/or the growth has extended to nearby lymph nodes.
Read more about breast cancer staging here.
# First find out what the values are for `pathologic_stage`
table(brca_clin_df$pathologic_stage)
##
## [Not vailable] I II III IV
## 5 179 615 250 19
## X
## 14
X denotes indeterminate stage.
# How many patients have stage II cancer?
N_stageII <- table(brca_clin_df$pathologic_stage)["II"]
print(N_stageII)
## II
## 615
# I need to report my result by using the `paste()` function to describe what I found.
print(paste("The number of patients with Stage II breast cancer:",N_stageII))
## [1] "The number of patients with Stage II breast cancer: 615"
What percentage of patients have stage II breast cancer?
# Find code above to calculate the percentage of stage II samples
# from N_stageII and the total number of patients.
# Print results.
percent_stageII <- round( N_stageII/1082 * 100 , 0)
print (paste("The percentage of patients with Stage II breast cancer: ",percent_stageII,"%", sep=""))
## [1] "The percentage of patients with Stage II breast cancer: 57%"
I have learned a lot about breast cancer by examining data systematically collected from patients. It is great that these patients gave their consent to contribute to TCGA. The data is anonymized so that it cannot be traced back to any particular person. TCGA data is publicly available and used by researchers worldwide to improve our ability to diagnose, treat, and prevent cancer.