mtcars
Under the File tab, choose Save As… and save your own copy of this file.
Then replace Your name here at the top of the file with
your name.
This document is your analysis notebook. It contains both:
That is one reason scientists use notebooks like R Markdown: they make the work reproducible. Someone else can read your report, see your code, and understand how you made each figure.
Large biological datasets are often too big to understand by reading numbers in a table.
In DREAM-High, we will eventually use heatmaps to look for patterns in breast cancer gene expression data from patients in The Cancer Genome Atlas. A heatmap can help us ask questions such as:
Today we will learn the same basic idea using a small practice dataset that comes with R.
Main idea: A heatmap turns numbers into colors so that hidden structure becomes easier to see.
R is a programming language used for data analysis, statistics, and visualization. In this course, we will use R to explore cancer-related datasets.
You do not need to memorize every command. Your goal is to learn how to:
RStudio is the environment we are using to write and run R code.
You are currently working in an R Markdown file. This type of file lets us combine:
in one report.
When you click Knit, RStudio runs the code and turns this notebook into an HTML report.
R Markdown uses simple formatting rules:
# creates a large heading## creates a smaller heading**bold** makes text bold*italic* makes text italicx <- 5A code chunk looks like this:
``` r
head(mtcars)
```
```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```
You can run one code chunk at a time by clicking the green triangle at the top right of the chunk.
mtcarsR comes with several built-in datasets. Today we will use
mtcars, a small dataset about 32 cars from the 1970s.
Each row is a car.
Each column is a feature of the car.
For example:
mpg = miles per gallonwt = weighthp = horsepowercyl = number of cylindersThis dataset is not about cancer, but it is useful because it is small, clean, and already available in R.
Later, the same heatmap logic will help us analyze cancer gene expression data.
# This is a comment line. Commenting your code is a critical practice!
# Look at the first few rows of the dataset
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Before moving on, answer in one or two sentences:
What do you notice first when you look at this table?
Write your answer here:
Before making a plot, scientists usually inspect the data.
# How many rows and columns are in mtcars?
dim(mtcars)
## [1] 32 11
# What are the column names?
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# What kind of object is mtcars?
class(mtcars)
## [1] "data.frame"
The result of dim(mtcars) tells us the number of rows
and columns.
Write your answer here:
R has built-in help pages.
You can run the following lines if you want more information. They
are marked with eval=FALSE, which means they will not run
automatically when you knit the report.
# what exactly is in mtcars?
help(mtcars)
# how is the heatmap function used and what parameters does it take?
help(heatmap)
The heatmap() function expects a numeric
matrix.
mtcars is a data frame. A data frame is like a table.
Because all columns in mtcars are numeric, we can convert
it to a matrix.
# Convert mtcars from a data frame to a numeric matrix
data <- as.matrix(mtcars)
# Check the object type
class(data)
## [1] "matrix" "array"
The symbol <- is the assignment operator. It stores
the result on the right side inside the object name on the left
side.
For example:
# Assign the value 5 to the variable x
x <- 5
x
## [1] 5
Here, R stores the number 5 in an object called
x.
Now we will make a heatmap from the raw mtcars
matrix.
# heatmap provides a color image of our data with dendrograms that help identify patterns
heatmap(data)
The heatmap turns numbers into colors.
It also clusters:
The tree diagrams on the sides are called dendrograms. They show which rows or columns are grouped together.
Look at the heatmap above.
Is it easy to see meaningful patterns? Why or why not?
Write your answer here:
Some columns in mtcars are measured on very different
scales.
For example, compare cyl and disp:
# Range of cylinders
min(mtcars$cyl)
## [1] 4
max(mtcars$cyl)
## [1] 8
# Range of displacement
min(mtcars$disp)
## [1] 71.1
max(mtcars$disp)
## [1] 472
These two features are not measured on the same scale. A value of 8
is high for cyl, but tiny for disp.
If we make a heatmap without adjusting for scale, large-number columns can dominate the visualization.
This same issue appears in biological data. Some measurements naturally have much larger ranges than others. If we do not account for scale, the strongest-looking pattern may simply reflect measurement units.
Before running the next code chunk, make a prediction.
What do you think will happen after we scale each feature so the columns are more comparable?
Write your prediction here:
The function scale() standardizes each column.
After scaling:
# Let's change the range of each feature so they are comparable
# We'll assign the output to a new variable data_scaled
data_scaled <- scale(data)
Let’s compare the first few values before and after scaling.
# First five rows and first five columns before scaling
data[1:5, 1:5]
## mpg cyl disp hp drat
## Mazda RX4 21.0 6 160 110 3.90
## Mazda RX4 Wag 21.0 6 160 110 3.90
## Datsun 710 22.8 4 108 93 3.85
## Hornet 4 Drive 21.4 6 258 110 3.08
## Hornet Sportabout 18.7 8 360 175 3.15
# First five rows and first five columns after scaling
data_scaled[1:5, 1:5]
## mpg cyl disp hp drat
## Mazda RX4 0.1508848 -0.1049878 -0.5706198 -0.5350928 0.5675137
## Mazda RX4 Wag 0.1508848 -0.1049878 -0.5706198 -0.5350928 0.5675137
## Datsun 710 0.4495434 -1.2248578 -0.9901821 -0.7830405 0.4739996
## Hornet 4 Drive 0.2172534 -0.1049878 0.2200937 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345 1.0148821 1.0430812 0.4129422 -0.8351978
The syntax data[1:5, 1:5] means:
This is called indexing.
Now we make a heatmap using the scaled data.
# heatmap provides a color image of our data with dendrograms that help identify patterns
heatmap(data_scaled)
Answer the following:
mpg and wt show similar or opposite
patterns?Write your answers here:
Different color palettes can make patterns easier or harder to see.
We will use the package RColorBrewer, which contains
ready-made color palettes.
# Packages are loaded with the library() function
library(RColorBrewer)
# Parameters for plotting, so we can see the colors' names
par(cex = 0.5)
# Show available palettes
display.brewer.all()
Now choose one palette and use it in a heatmap.
# Try changing "Greens" to another palette name
heatmap(data_scaled, col = brewer.pal(8, "Greens"))
Choose a different color palette and make a new heatmap.
# Replace "Purples" with a palette you choose
heatmap(data_scaled, col = brewer.pal(8, "Purples"))
Which palette did you choose, and why?
Sometimes we care less about individual rows and more about how variables relate to one another.
A correlation matrix measures how strongly pairs of features move together.
# create an object to hold the correlation values
cor_mtcars <- cor(mtcars)
# Round the values to two decimal places
round(cor_mtcars, 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
We can visualize this correlation matrix as a heatmap too.
# heatmap works with any numeric matrix
heatmap(cor_mtcars, symm = TRUE)
Find one pair of features with a strong positive relationship and one pair with a strong negative relationship.
Write your answer here:
Today we used cars because the dataset is small and easy to practice with.
But the same workflow can be used for biological data:
patients or samples → rows
genes or features → columns
expression values → numbers inside the matrix
heatmap → visual pattern discovery
clustering → grouping similar samples or genes
In breast cancer gene expression data, heatmaps can help reveal groups of patients whose tumors have similar molecular patterns. These patterns can sometimes correspond to clinically meaningful tumor subtypes.
This is one example of how coding helps us see structure in complex biological systems.
Answer these questions in a few sentences.
Write your reflection here:
Click the Knit button and choose Knit to HTML.
Your final report should include:
In this activity, you practiced:
<-,The most important idea is not the exact syntax. The most important idea is this:
Computational tools can help us discover patterns that are hidden inside large biological datasets.