In this activity, we will learn new skills in R with a large real-life dataset!
We will load and examine a data frame that contains clinical information from over 1,000 breast cancer patients from The Cancer Genome Atlas (TCGA).
TCGA characterized over 20,000 cancer samples spanning 33 cancer types with genomics.
Genomics is a field of biology that studies the entire set of DNA (the genome) in an organism (such as humans!), including all its genes and how they interact with each other and the environment. Learn more here.
In the next few activities, we will examine some of the different data types and the computational analyses that were performed to decipher breast cancer data from TCGA.
Preliminaries are the tasks we need to do before we begin our analysis.
Go to the File
tab and use Save as...
to create your own copy of this activity. Why? Because we inevitably make mistakes, and it’s great to have the original as a reference or to start over!
Many users and developers contribute packages to R that contain useful functionality not available with the default R distribution. At the beginning of our activities, we load the special packages we need with the library()
function.
Click the green arrow in the top right corner of the code chunk to run the code
Our data - including datasets and images - is located in the directory /home/data/
, and we will include this in the *path pointing to our files.
Breast cancer most often begins in the ducts and the lobules, the glands that create milk.
In breast cancer, the cells lining the ducts and lobules grow and divide abnormally.
# Images are matrices or arrays of pixel values.
duct_cells_img<-load.image("/home/data/duct_cells.jpg")
# The function `par()` defines parameters for plotting
#par(font.sub=4, mar=c(0,0,0,0))
plot(duct_cells_img,axes=FALSE)
We will examine patient data for different types of invasive (or infiltrating) cancers. Invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two most common types of invasive breast cancer.
Metastasis involves the development of secondary malignant growths at a distance from a primary site of cancer.
# Images are matrices or arrays of pixel values
meta_img<-load.image("/home/data/metastasis.jpg")
# The function `par()` defines parameters for plotting
par(font.sub=4, mar=c(0,0,0,0))
plot(meta_img,axes=FALSE)
Throughout the course, we will be loading data from R files with the extension .RData
. An example is the file brca_clin.RData
which we will work with in this activity.
.RData
files are binary files (i.e., not human readable) that hold R objects, such as data frames.
# The `load()` function reads in the data file "/home/data/brca_clin.RData".
# The argument "verbose = TRUE" makes `load()` tell us what R objects are being loaded.
# The objects will also appear in our "Environment" tab.
load("/home/data/brca_clin.RData",verbose=TRUE)
## Loading objects:
## brca_clin_df
The object brca_clin_df
was named to be be descriptive.
Data frames are data displayed as a table.
In data science, columns in a data frame are features and rows are records or observations.
The Environment tab gives us basic information on the objects that were loaded.
There are 1082 observations (patients) with 27 variables (clinical features).
We can use some simple R functions to examine the data in more detail.
dim()
tells us the dimensions (# rows, # columns) of the object.head()
returns the first several rows of the objects.# How many rows and columns are in our object?
dim(brca_clin_df)
## [1] 1082 27
Is the output what we expect? Check your Environment tab.
# Let's take a first look at the data in our object
# `head` gives us the first six rows and ALL of the columns
head(brca_clin_df)
Scroll through the columns by clicking on the arrow in the top right corner of the table.
We can see the entire data frame more conveniently in a new tab:
# Get a global view of the data
View(brca_clin_df)
Notice the column names. Let’s define a few:
bcr_patient_barcode
provides anonymous identifiers for the patients.histological_type
describes the ductal invasion.estrogen_receptor_status
designates whether the tumor is ER+ or ER-.We can use indexing to view specific elements of the data frame:
# Look at a single element
# The first number denotes row, the second number denotes column
# The first feature for the patient in row 1
brca_clin_df[1,1]
## [1] "TCGA-3C-AAAU"
The function c()
combines or concatenates objects into a single list.
# The function`c()` combines values into a list
# The colon operator `:` creates a sequence of numbers
c(2:5,8,14)
## [1] 2 3 4 5 8 14
Let’s look at the last three features for a random group of patients.
brca_clin_df[c(2:5,8,14),25:27]
## estrogen_receptor_status progesterone_receptor_status her2_receptor_status
## 2 Positive Positive Positive
## 3 Positive Positive Indeterminate
## 4 Positive Positive Positive
## 5 Positive Positive Equivocal
## 8 Positive Negative Negative
## 14 Positive Positive Negative
Work through the following five exercises to get used to extracting rows and columns from tables. Experiment by changing the values for indexing.
Change the index numbers in the code chunks above to explore the data frame.
# Look at a single row or patient
# `t()` transposes the data - switches the row to a column
# We use it here to make the result easier to read
t(brca_clin_df[2,])
## 2
## bcr_patient_barcode "TCGA-3C-AALI"
## gender "FEMALE"
## race "BLACK OR AFRICAN AMERICAN"
## ethnicity "NOT HISPANIC OR LATINO"
## age_at_diagnosis "50"
## year_of_initial_pathologic_diagnosis "2003"
## vital_status "Alive"
## menopause_status "Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)"
## tumor_status "TUMOR FREE"
## margin_status "Negative"
## days_to_last_followup "4005"
## prior_dx "No"
## new_tumor_event_after_initial_treatment "NO"
## radiation_therapy "YES"
## histological_type "Infiltrating Ductal Carcinoma"
## pathologic_T "T2"
## pathologic_M "M0"
## pathologic_N "N1a"
## pathologic_stage_sub "Stage IIB"
## pathologic_stage "II"
## lymph_node_examined_count "15"
## number_of_lymphnodes_positive "1"
## initial_diagnosis_method "Core needle biopsy"
## surgical_procedure "Lumpectomy"
## estrogen_receptor_status "Positive"
## progesterone_receptor_status "Positive"
## her2_receptor_status "Positive"
Note that the output data are identified with the column names (features) of our data frame. Can you guess what some of the features correspond to?
Google some of the terms, e.g. “pathologic stage” and “initial diagnosis method.” This is a great way to learn more about breast cancer.
We can examine data for multiple patients and features using the the combine function c()
and the colon operator.
Can you predict what the output will be?
# Look at a few features for several patients
brca_clin_df[c(15,30,35),1:5]
## bcr_patient_barcode gender race ethnicity
## 15 TCGA-A1-A0SJ FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 30 TCGA-A2-A04X FEMALE WHITE [Not Available]
## 35 TCGA-A2-A0CO FEMALE WHITE NOT HISPANIC OR LATINO
## age_at_diagnosis
## 15 39
## 30 34
## 35 85
What happens if we use a value that is out of bounds, for example, for a column that doesn’t exist? brca_clin_df
has 27 columns. Let’s try to get the 30th column.
# Get the 30th column
brca_clin_df[ ,30]
Readability: We can combine functions with text to get more informative output.
Remember that the dim()
function provides two numbers:
# Assign the dimensions of the data frame to an object.
dim_df <- dim(brca_clin_df)
print(dim_df)
## [1] 1082 27
How do we get just the number of rows?
# dim_df contains two elements. We can get the first and the second
# with indexing
dim_df[1]
## [1] 1082
dim_df[2]
## [1] 27
# Since `dim_df` has two values, we can choose each separately with indexing.
# The function paste() allows us to print text and numbers together.
print(paste("The clinical data frame has",dim_df[1],
"rows and",dim_df[2],"columns."))
## [1] "The clinical data frame has 1082 rows and 27 columns."
What are all of the features in the data frame? We can collect the column names.
# Many function names in R are intuitive.
# colnames gives the names of the columns.
colnames(brca_clin_df)
## [1] "bcr_patient_barcode"
## [2] "gender"
## [3] "race"
## [4] "ethnicity"
## [5] "age_at_diagnosis"
## [6] "year_of_initial_pathologic_diagnosis"
## [7] "vital_status"
## [8] "menopause_status"
## [9] "tumor_status"
## [10] "margin_status"
## [11] "days_to_last_followup"
## [12] "prior_dx"
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"
## [15] "histological_type"
## [16] "pathologic_T"
## [17] "pathologic_M"
## [18] "pathologic_N"
## [19] "pathologic_stage_sub"
## [20] "pathologic_stage"
## [21] "lymph_node_examined_count"
## [22] "number_of_lymphnodes_positive"
## [23] "initial_diagnosis_method"
## [24] "surgical_procedure"
## [25] "estrogen_receptor_status"
## [26] "progesterone_receptor_status"
## [27] "her2_receptor_status"
Notice there are 27 column names for the 27 columns. By now, you should understand what some of the terms refer to. If not, loop back to Practice 2!
# We can save the feature names to a file for reference!
# `write.csv()` saves the output of our command to a csv file
# in your working directory.
write.csv(colnames(brca_clin_df),"features.csv", row.names=FALSE)
Check your directory to see if the file is there. Click on the file name to view it. Try this again by removing the argument row.names=FALSE
to see what happens.
Viewing the data frame gave us an idea of the types of information included, but how do we summarize the values for all 1,082 patients for a given feature, such as gender
, vital_status
, or estrogen_receptor_status
?
It looks like all of the patients are “FEMALE”, but is this true?
The table()
function is VERY useful. It summarizes the data for us. gender
is the second column of our data frame.
# When an index value is empty, all values are included
# So we want all rows in our summary table
# 2 refers to the second column
table(brca_clin_df[ , 2])
##
## FEMALE MALE
## 1070 12
What if we don’t know what column gender
refers to? Or we don’t feel like counting, especially if we have a large data frame? We know the name of the column, and we can use this information directly.
# The $ operator can be used to select a column with its name
# We can get the column for gender information with `$gender`.
table(brca_clin_df$gender)
##
## FEMALE MALE
## 1070 12
There are a dozen male patients in this cohort.
The function hist()
creates a histogram for numerical data. We want to use this for the features that are continuous rather than categorical.
A histogram is a diagram with rectangles whose area is proportional to the frequency of a feature and whose width is equal to the feature interval.
Let’s see what values are in the feature age_at_diagnosis
.
# The $ operator can be used to select a column with its name.
hist(brca_clin_df$age_at_diagnosis)
For example, there are almost 150 patients between the ages 45 and 50.
As with many measurements in nature, the variable age_at_diagnosis
follows a normal distribution (i.e. it is bell curve). Here are some examples.
We should label our axes and add a title that make our plot easier to interpret!
# The $ operator can be used to select a column with its name.
# `xlab` and `main` are arguments to `hist()`.
hist(brca_clin_df$age_at_diagnosis,
xlab="Age at Diagnosis", #label the x-axis
main="Distribution of Ages") #Add a title
Remember, we can always use the help()
function to learn more about what a function does and what arguments it takes.
# Get the help page for the function `hist()`
help(hist)
The Central Limit Theorem is a fundamental concept in statistics: For a variable that is influenced by many small, independent factors, the overall distribution of that variable will tend towards a normal distribution.
Many (not all!) natural phenomena can be seen as the result of numerous, small, and relatively independent contributing factors. For example, human height is influenced by genetics, nutrition, environment, and other factors. Each of these factors contributes a small amount to the overall height, and because they are largely independent, the resulting distribution tends to be approximately normal.
Often natural data are skewed where the values cluster more on one side of the distribution, creating a “tail” on either the left or the right.
# We'll use some R datasets to examine skewed data
# Set up a 1x2 layout for side-by-side plots
par(mfrow = c(1, 2))
# Annual measurements of the level, in feet, of Lake Huron 1875–1972.
hist(LakeHuron,
xlab = "Lake height",
main="Annual level of Lake Huron")
# Measurements of the annual flow of the river Nile at Aswan, 1871–1970
hist(Nile,
xlab = "Volume of flow",
main = "Annual flow of the river Nile")
# Reset the layout for plots
par(mfrow = c(1, 1))
Breast cancer cells are tested to see if they have estrogen receptors. When the hormone estrogen attaches to its receptors, it stimulates the cancer to grow. Cancers are called estrogen receptor-positive (ER+) or estrogen receptor-negative (ER-) based on whether or not they have these receptors.
Knowing the estrogen receptor status is important in deciding among treatment options.
Many breast cancer tumors are estrogen receptor (ER) positive. How many ER+ and ER- tumors are in the TCGA breast cancer cohort?
In the previous section, we saw there is a feature called estrogen_receptor_status
.
# The $ operator can be used to select a column with its name.
table(brca_clin_df$estrogen_receptor_status)
##
## [Not Evaluated] Indeterminate Negative Positive
## 48 2 236 796
Some samples do not have information on receptor status, either “[Not Evaluated]” or “Indeterminate”.
It is usually the case with real-life data that not all records (here, patients) have values for all features.
So when working with complex data sets like this, we have to be aware that there may be missing values.
Like estrogen, progesterone is a hormone that stimulates breast cancer cells to grow.
Let’s look at the status of the progesterone receptor negative (PR). There’s a data frame column called progesterone_receptor_status
.
# The $ operator can be used to select a column with its name.
table(brca_clin_df$progesterone_receptor_status)
##
## [Not Evaluated] Indeterminate Negative Positive
## 49 4 340 689
According to these summaries, there are 796 ER+ samples and 689 PR+ samples.
How many samples have positive status for both estrogen and progesterone receptors? In other words, how many samples have both ER+ and PR+?
We want to find the patients (rows in brca_clin_df
) that have
brca_clin_df$estrogen_receptor_status == “Positive”
AND
brca_clin_df$progesterone_receptor_status == “Positive”.
# The operator == finds instances that contain the value "Positive"
# We use the & operator to apply both conditions.
# Save the results to an object called `ERpos_PR_pos`
ERpos_PRpos <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Positive"
&
brca_clin_df$progesterone_receptor_status == "Positive"), ]
# How many patients (rows) have both ER+ and PR+?
nrow(ERpos_PRpos)
## [1] 672
The overlap between the 796 ER+ tumors and 689 PR+ tumors is 672.
Many breast cancer tumors are both ER+ and PR+.
We can make this output more readable for our report. Remember, we already saved this data frame result to the object ERpos_PRpos
.
# The function paste() allows us to print text and numbers together.
print(paste("The number of samples with both ER+ and PR+ is", nrow(ERpos_PRpos)))
## [1] "The number of samples with both ER+ and PR+ is 672"
HER2 is the gene name for the Receptor tyrosine-protein kinase erbB-2; it helps control cell growth.
There is a data feature called “her2_receptor_status.”
(Remember, you can see this by using the colnames()
function as we did above, or you can look at the brca_clin_df table we created with the View()
function.)
# The $ operator can be used to select a column with its name.
table(brca_clin_df$her2_receptor_status)
##
## [Not Available] [Not Evaluated] Equivocal Indeterminate Negative
## 8 170 177 12 554
## Positive
## 161
In addition to the missing values we found for ER and PR ([Not Evaluated] and Indeterminate), HER2 status can also be [Not Available] and “Equivocal.”
Triple negative breast cancer samples have “Negative” values for
Triple negative breast cancer is agressive and difficult to treat.
How many samples have “Negative” status in all three cases?
The pieces of information we need are in the columns called
Let’s make a data frame tnbc
that contains information about the patients with triple negative breast cancer.
# The operator == finds instances that contain the value "Positive"
# We use the & operator to apply both conditions.
# Save the results to an object called `ERpos_PR_pos`
tnbc <- brca_clin_df[(brca_clin_df$estrogen_receptor_status == "Negative"
&
brca_clin_df$progesterone_receptor_status == "Negative"
&
brca_clin_df$her2_receptor_status == "Negative"),]
# How many patients (rows) have both ER+ and PR+?
nrow(tnbc)
## [1] 114
The data frame tnbc
contains a subset of the full data frame brca_clin_df
. We can examine the data and perform summary analyses with the table()
and hist()
functions.
# Examine features for patients with triple negative breast cancer
View(tnbc)
# What is the gender distribution for triple negative breast cancer
table(tnbc$gender)
##
## FEMALE
## 114
So overall, 114 samples out of 1082 total samples correspond to Triple Negative Breast Cancer, the most aggressive type of primary breast cancer, and all of the samples are from FEMALE
patients.
Create a table for “histological_type”.
table(brca_clin_df$histological_type)
##
## [Not Available] Infiltrating Carcinoma NOS
## 1 1
## Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma
## 774 201
## Medullary Carcinoma Metaplastic Carcinoma
## 6 8
## Mixed Histology (please specify) Mucinous Carcinoma
## 29 17
## Other, specify
## 45
Most of the breast cancer samples are either “Infiltrating Ductal Carcinoma” (774 samples) or “Infiltrating Lobular Carcinoma” (201 samples), the two most common types of breast cancer. Infilatrating corresponds to invasive. Refer back to the figure “Microscopy Images of Ducts.”
We can look for a specific instance of an histological type by indexing on type name, find its frequency, and the percentage of cases it represents.
# Save the number IDC samples to a new object
N_ductal <- table(brca_clin_df$histological_type)["Infiltrating Ductal Carcinoma"]
percent_ductal <- round( N_ductal/1082 * 100 , 0)
print(paste(percent_ductal,"% of breast cancer patients have Infiltrating Ductal Carcinoma", sep=""))
## [1] "72% of breast cancer patients have Infiltrating Ductal Carcinoma"
Put your skills together to find out how many patients have stage II cancer.
Stage II describes invasive cancer that is contained in the breast and/or the growth has extended to nearby lymph nodes.
Read more about breast cancer staging here.
# First find out what the values are for `pathologic_stage`
# Uncomment the code lines below, and replace fffff with a function name
table(brca_clin_df$pathologic_stage)
##
## [Not vailable] I II III IV
## 5 179 615 250 19
## X
## 14
X denotes indeterminate stage.
# How many patients have stage II cancer?
# Uncomment the code lines below, and replace xxxxxx with the correct feature name
# and YY with the stage name
N_stageII <- table(brca_clin_df$pathologic_stage)["II"]
print(N_stageII)
## II
## 615
How would you report your result? Uncomment the print command and use the paste()
function, and a combination of text and variables to describe what you found.
print(paste(N_stageII, "patients have Stage II cancer."))
## [1] "615 patients have Stage II cancer."
What percentage of patients have stage II breast cancer?
# Find code above to calculate the percentage of stage II samples
# from N_stageII and the total number of patients.
# Then print your results.
# Don't forget to uncomment your code lines!
percent_stageII <- round( N_stageII/1082 * 100 , 0)
print(paste(percent_stageII,"% of patients have Stage II cancer.", sep=""))
## [1] "57% of patients have Stage II cancer."
Great work!!
We have learned a lot about breast cancer by examining data systematically collected from patients. It is great that these patients gave their consent to contribute to TCGA. The data is anonymized so that it cannot be traced back to any particular person. TCGA data is publicly available and used by researchers worldwide to improve our ability to diagnose, treat, and prevent cancer.
You can learn more about TCGA here.
Save your file. Find the knit
button at the top of this tab. From the dropdown menu, choose Knit to html
. This will create a file you can view in a web browser. If an error is reported during knitting, it will direct you to the line that needs to be fixed. When you see your final report, you will appreciate why we are using R Markdown in our environment to create comments, include figures, and write and run code.
In the next few sessions, we will be introduced to the gene expression data associated with the patients whose clinical data we have been examining.