Before you begin

Under the File tab, choose Save As… and save your own copy of this file.

Then replace Your name here at the top of the file with your name.

This document is your analysis notebook. It contains both:

  1. the code you ran, and
  2. the explanation of what you discovered.

That is one reason scientists use notebooks like R Markdown: they make the work reproducible. Someone else can read your report, see your code, and understand how you made each figure.


Big idea: data can hide patterns

Large biological datasets are often too big to understand by reading numbers in a table.

In DREAM-High, we will eventually use heatmaps to look for patterns in breast cancer gene expression data from patients in The Cancer Genome Atlas. A heatmap can help us ask questions such as:

Today we will learn the same basic idea using a small practice dataset that comes with R.

Main idea: A heatmap turns numbers into colors so that hidden structure becomes easier to see.


What is R?

R is a programming language used for data analysis, statistics, and visualization. In this course, we will use R to explore cancer-related datasets.

You do not need to memorize every command. Your goal is to learn how to:


What is RStudio?

RStudio is the environment we are using to write and run R code.

You are currently working in an R Markdown file. This type of file lets us combine:

in one report.

When you click Knit, RStudio runs the code and turns this notebook into an HTML report.


A few R Markdown basics

R Markdown uses simple formatting rules:

A code chunk looks like this:


``` r
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

You can run one code chunk at a time by clicking the green triangle at the top right of the chunk.


The practice dataset: mtcars

R comes with several built-in datasets. Today we will use mtcars, a small dataset about 32 cars from the 1970s.

Each row is a car.
Each column is a feature of the car.

For example:

This dataset is not about cancer, but it is useful because it is small, clean, and already available in R.

Later, the same heatmap logic will help us analyze cancer gene expression data.

# This is a comment line. Commenting your code is a critical practice!

# Look at the first few rows of the dataset
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Quick reflection

Before moving on, answer in one or two sentences:

What do you notice first when you look at this table?

Write your answer here:


Learning to inspect data

Before making a plot, scientists usually inspect the data.

# How many rows and columns are in mtcars?
dim(mtcars)
## [1] 32 11
# What are the column names?
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# What kind of object is mtcars?
class(mtcars)
## [1] "data.frame"

The result of dim(mtcars) tells us the number of rows and columns.

Check yourself

  • How many cars are in the dataset?
  • How many features are measured for each car?

Write your answer here:


Getting help in R

R has built-in help pages.

You can run the following lines if you want more information. They are marked with eval=FALSE, which means they will not run automatically when you knit the report.

# what exactly is in mtcars?
help(mtcars)

# how is the heatmap function used and what parameters does it take?
help(heatmap)

From data frame to matrix

The heatmap() function expects a numeric matrix.

mtcars is a data frame. A data frame is like a table. Because all columns in mtcars are numeric, we can convert it to a matrix.

# Convert mtcars from a data frame to a numeric matrix
data <- as.matrix(mtcars)

# Check the object type
class(data)
## [1] "matrix" "array"

The symbol <- is the assignment operator. It stores the result on the right side inside the object name on the left side.

For example:

# Assign the value 5 to the variable x
x <- 5
x
## [1] 5

Here, R stores the number 5 in an object called x.


First heatmap: raw data

Now we will make a heatmap from the raw mtcars matrix.

# heatmap provides a color image of our data with dendrograms that help identify patterns
heatmap(data)

The heatmap turns numbers into colors.

It also clusters:

The tree diagrams on the sides are called dendrograms. They show which rows or columns are grouped together.

Interpretation question

Look at the heatmap above.

Is it easy to see meaningful patterns? Why or why not?

Write your answer here:


Why scaling matters

Some columns in mtcars are measured on very different scales.

For example, compare cyl and disp:

# Range of cylinders
min(mtcars$cyl)
## [1] 4
max(mtcars$cyl)
## [1] 8
# Range of displacement
min(mtcars$disp)
## [1] 71.1
max(mtcars$disp)
## [1] 472

These two features are not measured on the same scale. A value of 8 is high for cyl, but tiny for disp.

If we make a heatmap without adjusting for scale, large-number columns can dominate the visualization.

This same issue appears in biological data. Some measurements naturally have much larger ranges than others. If we do not account for scale, the strongest-looking pattern may simply reflect measurement units.


Prediction before scaling

Before running the next code chunk, make a prediction.

What do you think will happen after we scale each feature so the columns are more comparable?

Write your prediction here:


Scaling the data

The function scale() standardizes each column.

After scaling:

# Let's change the range of each feature so they are comparable
# We'll assign the output to a new variable data_scaled

data_scaled <- scale(data)

Let’s compare the first few values before and after scaling.

# First five rows and first five columns before scaling
data[1:5, 1:5]
##                    mpg cyl disp  hp drat
## Mazda RX4         21.0   6  160 110 3.90
## Mazda RX4 Wag     21.0   6  160 110 3.90
## Datsun 710        22.8   4  108  93 3.85
## Hornet 4 Drive    21.4   6  258 110 3.08
## Hornet Sportabout 18.7   8  360 175 3.15
# First five rows and first five columns after scaling
data_scaled[1:5, 1:5]
##                          mpg        cyl       disp         hp       drat
## Mazda RX4          0.1508848 -0.1049878 -0.5706198 -0.5350928  0.5675137
## Mazda RX4 Wag      0.1508848 -0.1049878 -0.5706198 -0.5350928  0.5675137
## Datsun 710         0.4495434 -1.2248578 -0.9901821 -0.7830405  0.4739996
## Hornet 4 Drive     0.2172534 -0.1049878  0.2200937 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345  1.0148821  1.0430812  0.4129422 -0.8351978

The syntax data[1:5, 1:5] means:

This is called indexing.


Second heatmap: scaled data

Now we make a heatmap using the scaled data.

# heatmap provides a color image of our data with dendrograms that help identify patterns
heatmap(data_scaled)

Interpretation questions

Answer the following:

  1. Does this heatmap look more informative than the raw-data heatmap?
  2. Which features seem to group together?
  3. Do mpg and wt show similar or opposite patterns?
  4. Do the car groupings make sense based on what you know about cars?

Write your answers here:


Color choices

Different color palettes can make patterns easier or harder to see.

We will use the package RColorBrewer, which contains ready-made color palettes.

# Packages are loaded with the library() function
library(RColorBrewer)
# Parameters for plotting, so we can see the colors' names
par(cex = 0.5)

# Show available palettes
display.brewer.all()

Now choose one palette and use it in a heatmap.

# Try changing "Greens" to another palette name
heatmap(data_scaled, col = brewer.pal(8, "Greens"))

Customization challenge

Choose a different color palette and make a new heatmap.

# Replace "Purples" with a palette you choose
heatmap(data_scaled, col = brewer.pal(8, "Purples"))

Which palette did you choose, and why?


A more focused question: relationships among features

Sometimes we care less about individual rows and more about how variables relate to one another.

A correlation matrix measures how strongly pairs of features move together.

# create an object to hold the correlation values
cor_mtcars <- cor(mtcars)

# Round the values to two decimal places
round(cor_mtcars, 2)
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

We can visualize this correlation matrix as a heatmap too.

# heatmap works with any numeric matrix
heatmap(cor_mtcars, symm = TRUE)

Interpretation question

Find one pair of features with a strong positive relationship and one pair with a strong negative relationship.

Write your answer here:


Why this matters for cancer biology

Today we used cars because the dataset is small and easy to practice with.

But the same workflow can be used for biological data:

patients or samples  → rows
genes or features    → columns
expression values    → numbers inside the matrix
heatmap              → visual pattern discovery
clustering           → grouping similar samples or genes

In breast cancer gene expression data, heatmaps can help reveal groups of patients whose tumors have similar molecular patterns. These patterns can sometimes correspond to clinically meaningful tumor subtypes.

This is one example of how coding helps us see structure in complex biological systems.


Final reflection

Answer these questions in a few sentences.

  1. What did scaling change about the heatmap?
  2. Why might heatmaps be useful for studying cancer data?
  3. What is one R concept you feel more comfortable with now?
  4. What is one R concept you still find confusing?

Write your reflection here:


Knit your report

Click the Knit button and choose Knit to HTML.

Your final report should include:


Summary

In this activity, you practiced:

The most important idea is not the exact syntax. The most important idea is this:

Computational tools can help us discover patterns that are hidden inside large biological datasets.