DREAM-High: Finding Patterns with Heatmaps

Before you begin

Under the File tab, choose Save As… and save your own copy of this file.

Then replace Your name here at the top of the file with your name.

This document is your analysis notebook. It contains both:

the code you ran, and
the explanation of what you discovered.

That is one reason scientists use notebooks like R Markdown: they make the work reproducible. Someone else can read your report, see your code, and understand how you made each figure.

Big idea: data can hide patterns

Large biological datasets are often too big to understand by reading numbers in a table.

In DREAM-High, we will eventually use heatmaps to look for patterns in breast cancer gene expression data from patients in The Cancer Genome Atlas. A heatmap can help us ask questions such as:

Which samples look similar to each other?
Which genes behave similarly across patients?
Can visual patterns help us discover tumor subtypes?

Today we will learn the same basic idea using a small practice dataset that comes with R.

Main idea: A heatmap turns numbers into colors so that hidden structure becomes easier to see.

What is R?

R is a programming language used for data analysis, statistics, and visualization. In this course, we will use R to explore cancer-related datasets.

You do not need to memorize every command. Your goal is to learn how to:

run code,
inspect data,
make plots,
ask questions,
and interpret patterns.

What is RStudio?

RStudio is the environment we are using to write and run R code.

You are currently working in an R Markdown file. This type of file lets us combine:

text,
code,
results,
plots,
and interpretation

in one report.

When you click Knit, RStudio runs the code and turns this notebook into an HTML report.

A few R Markdown basics

R Markdown uses simple formatting rules:

# creates a large heading
## creates a smaller heading
**bold** makes text bold
*italic* makes text italic
backticks show code, like x <- 5
code chunks contain R code

A code chunk looks like this:


``` r
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

You can run one code chunk at a time by clicking the green triangle at the top right of the chunk.

The practice dataset: `mtcars`

R comes with several built-in datasets. Today we will use mtcars, a small dataset about 32 cars from the 1970s.

Each row is a car.
Each column is a feature of the car.

For example:

mpg = miles per gallon
wt = weight
hp = horsepower
cyl = number of cylinders

This dataset is not about cancer, but it is useful because it is small, clean, and already available in R.

Later, the same heatmap logic will help us analyze cancer gene expression data.

# This is a comment line. Commenting your code is a critical practice!

# Look at the first few rows of the dataset
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Quick reflection

Before moving on, answer in one or two sentences:

What do you notice first when you look at this table?

Write your answer here:

Learning to inspect data

Before making a plot, scientists usually inspect the data.

# How many rows and columns are in mtcars?
dim(mtcars)

## [1] 32 11

# What are the column names?
colnames(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

# What kind of object is mtcars?
class(mtcars)

## [1] "data.frame"

The result of dim(mtcars) tells us the number of rows and columns.

Check yourself

How many cars are in the dataset?
How many features are measured for each car?

Write your answer here:

Getting help in R

R has built-in help pages.

You can run the following lines if you want more information. They are marked with eval=FALSE, which means they will not run automatically when you knit the report.

# what exactly is in mtcars?
help(mtcars)

# how is the heatmap function used and what parameters does it take?
help(heatmap)

From data frame to matrix

The heatmap() function expects a numeric matrix.

mtcars is a data frame. A data frame is like a table. Because all columns in mtcars are numeric, we can convert it to a matrix.

# Convert mtcars from a data frame to a numeric matrix
data <- as.matrix(mtcars)

# Check the object type
class(data)

## [1] "matrix" "array"

The symbol <- is the assignment operator. It stores the result on the right side inside the object name on the left side.

For example:

# Assign the value 5 to the variable x
x <- 5
x

## [1] 5

Here, R stores the number 5 in an object called x.

First heatmap: raw data

Now we will make a heatmap from the raw mtcars matrix.

# heatmap provides a color image of our data with dendrograms that help identify patterns
heatmap(data)

The heatmap turns numbers into colors.

It also clusters:

rows: cars that look similar across features
columns: features that vary in similar ways

The tree diagrams on the sides are called dendrograms. They show which rows or columns are grouped together.

Interpretation question

Look at the heatmap above.

Is it easy to see meaningful patterns? Why or why not?

Write your answer here:

Why scaling matters

Some columns in mtcars are measured on very different scales.

For example, compare cyl and disp:

# Range of cylinders
min(mtcars$cyl)

## [1] 4

max(mtcars$cyl)

## [1] 8

# Range of displacement
min(mtcars$disp)

## [1] 71.1

max(mtcars$disp)

## [1] 472

These two features are not measured on the same scale. A value of 8 is high for cyl, but tiny for disp.

If we make a heatmap without adjusting for scale, large-number columns can dominate the visualization.

This same issue appears in biological data. Some measurements naturally have much larger ranges than others. If we do not account for scale, the strongest-looking pattern may simply reflect measurement units.

Prediction before scaling

Before running the next code chunk, make a prediction.

What do you think will happen after we scale each feature so the columns are more comparable?

Write your prediction here:

Scaling the data

The function scale() standardizes each column.

After scaling:

each column has an average close to 0
each column has a standard deviation of 1
features become more comparable

# Let's change the range of each feature so they are comparable
# We'll assign the output to a new variable data_scaled

data_scaled <- scale(data)

Let’s compare the first few values before and after scaling.

# First five rows and first five columns before scaling
data[1:5, 1:5]

##                    mpg cyl disp  hp drat
## Mazda RX4         21.0   6  160 110 3.90
## Mazda RX4 Wag     21.0   6  160 110 3.90
## Datsun 710        22.8   4  108  93 3.85
## Hornet 4 Drive    21.4   6  258 110 3.08
## Hornet Sportabout 18.7   8  360 175 3.15

# First five rows and first five columns after scaling
data_scaled[1:5, 1:5]

##                          mpg        cyl       disp         hp       drat
## Mazda RX4          0.1508848 -0.1049878 -0.5706198 -0.5350928  0.5675137
## Mazda RX4 Wag      0.1508848 -0.1049878 -0.5706198 -0.5350928  0.5675137
## Datsun 710         0.4495434 -1.2248578 -0.9901821 -0.7830405  0.4739996
## Hornet 4 Drive     0.2172534 -0.1049878  0.2200937 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345  1.0148821  1.0430812  0.4129422 -0.8351978

The syntax data[1:5, 1:5] means:

rows 1 through 5
columns 1 through 5

This is called indexing.

Second heatmap: scaled data

Now we make a heatmap using the scaled data.

# heatmap provides a color image of our data with dendrograms that help identify patterns
heatmap(data_scaled)

Interpretation questions

Answer the following:

Does this heatmap look more informative than the raw-data heatmap?
Which features seem to group together?
Do mpg and wt show similar or opposite patterns?
Do the car groupings make sense based on what you know about cars?

Write your answers here:

Color choices

Different color palettes can make patterns easier or harder to see.

We will use the package RColorBrewer, which contains ready-made color palettes.

# Packages are loaded with the library() function
library(RColorBrewer)

# Parameters for plotting, so we can see the colors' names
par(cex = 0.5)

# Show available palettes
display.brewer.all()

Now choose one palette and use it in a heatmap.

# Try changing "Greens" to another palette name
heatmap(data_scaled, col = brewer.pal(8, "Greens"))

Customization challenge

Choose a different color palette and make a new heatmap.

# Replace "Purples" with a palette you choose
heatmap(data_scaled, col = brewer.pal(8, "Purples"))

Which palette did you choose, and why?

A more focused question: relationships among features

Sometimes we care less about individual rows and more about how variables relate to one another.

A correlation matrix measures how strongly pairs of features move together.

A positive correlation means two features tend to increase together.
A negative correlation means one feature tends to increase while the other decreases.
A correlation near zero means there is not much linear relationship.

# create an object to hold the correlation values
cor_mtcars <- cor(mtcars)

# Round the values to two decimal places
round(cor_mtcars, 2)

##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

We can visualize this correlation matrix as a heatmap too.

# heatmap works with any numeric matrix
heatmap(cor_mtcars, symm = TRUE)

Interpretation question

Find one pair of features with a strong positive relationship and one pair with a strong negative relationship.

Write your answer here:

Why this matters for cancer biology

Today we used cars because the dataset is small and easy to practice with.

But the same workflow can be used for biological data:

patients or samples  → rows
genes or features    → columns
expression values    → numbers inside the matrix
heatmap              → visual pattern discovery
clustering           → grouping similar samples or genes

In breast cancer gene expression data, heatmaps can help reveal groups of patients whose tumors have similar molecular patterns. These patterns can sometimes correspond to clinically meaningful tumor subtypes.

This is one example of how coding helps us see structure in complex biological systems.

Final reflection

Answer these questions in a few sentences.

What did scaling change about the heatmap?
Why might heatmaps be useful for studying cancer data?
What is one R concept you feel more comfortable with now?
What is one R concept you still find confusing?

Write your reflection here:

Knit your report

Click the Knit button and choose Knit to HTML.

Your final report should include:

your name,
your code,
your heatmaps,
your answers to the reflection questions,
and your interpretation of the patterns you found.

Summary

In this activity, you practiced:

working in RStudio,
editing an R Markdown notebook,
running R code chunks,
inspecting a dataset,
converting a data frame to a matrix,
using the assignment operator <-,
indexing rows and columns,
making heatmaps,
scaling data,
changing color palettes,
interpreting clusters and correlations,
and connecting a simple dataset to future cancer biology analyses.

The most important idea is not the exact syntax. The most important idea is this:

Computational tools can help us discover patterns that are hidden inside large biological datasets.

DREAM-High: Finding Patterns with Heatmaps

DREAM-High Consortium

5/28/2026

Before you begin

Big idea: data can hide patterns

What is R?

What is RStudio?

A few R Markdown basics

The practice dataset: `mtcars`

Quick reflection

Learning to inspect data

Check yourself

Getting help in R

From data frame to matrix

First heatmap: raw data

Interpretation question

Why scaling matters

Prediction before scaling

Scaling the data

Second heatmap: scaled data

Interpretation questions

Color choices

Customization challenge

A more focused question: relationships among features

Interpretation question

Why this matters for cancer biology

Final reflection

Knit your report

Summary

DREAM-High: Finding Patterns with Heatmaps

DREAM-High Consortium

5/28/2026

Before you begin

Big idea: data can hide patterns

What is R?

What is RStudio?

A few R Markdown basics

The practice dataset: mtcars

Quick reflection

Learning to inspect data

Check yourself

Getting help in R

From data frame to matrix

First heatmap: raw data

Interpretation question

Why scaling matters

Prediction before scaling

Scaling the data

Second heatmap: scaled data

Interpretation questions

Color choices

Customization challenge

A more focused question: relationships among features

Interpretation question

Why this matters for cancer biology

Final reflection

Knit your report

Summary

The practice dataset: `mtcars`