.Rmd stands for R mark down file
Under the File tab, use Save As… to make a version of this file with a new name. In case things go sideways, we can go back to the original.
At the top of this document, put your name between the quotes after author
. This is now your notebook.
R is a free, open-source programming language widely used in academia and industry for data analysis, statistical modeling, and data visualization. Throughout the DREAM-High Program, we will be coding with R.
RStudio is a free and open-source environment for the R programming language. It provides a user-friendly interface to make working with R and generating reports pretty easy. We are working in R Studio now!
An R Markdown file, with the .Rmd extension, is a plain text document that combines text formatted in Markdown syntax with code written in R and other languages. Click the “Insert” tab at the top right of this window to see what kinds of programming languages can be used.
With R Markdown, we can generate reproducible and generalizable workflows. We will create beautiful reports we can share: Our R Markdown files can be “knit” (or rendered) into HTML pages, PDFs, Word documents, or slides.
x <- 5
We won’t see the effects of the syntax until we create our report with the “Knit” button at the top of this window.
Heatmaps are a way to colorize, visualize, and organize a data set with the goal of finding relationships among observations and features.
We will use heatmaps in this course to find patterns in the gene expression data for the 1K breast cancer patients from The Cancer Genome Atlas. Here, we will learn how to create heatmaps with a practice data set.
R provides many data sets to work with, so we can learn new analysis skills before scaling up. mtcars
is a classic go-to R data frame. It was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 design and performance features for 32 automobiles (1973–74 models).
Jurui notes - gray spaces are code chunks - white spaces are notes
# This is a comment line
# Functions in R take arguments within the parentheses
# The function head() returns the first few lines of the mtcars table
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can create a table of the entire data set in a new tab with the View()
function.
# Check out the full data set
View(mtcars)
Each row of mtcars
is an automobile, and each column is a performance feature. For example:
mpg
is miles per gallonwt
is weightThe function help()
provides information on R functions and data. We can find out what all the performance features are:
# what exactly is in mtcars?
help(mtcars)
#contains info about units and provides notes
The function heatmap()
is an easy way to convert the values in mtcars to colors which helps us visualize the data and look for relationships.
Let’s check out the help page for heatmap()
.
# The help() function can take a function as its argument
help(heatmap)
In the help file, we learn that heatmap()
plots a numeric matrix of values. So our first step will be to ensure that the data are converted from a table or data frame
to a matrix of number values. We will do most of our analysis on data in matrix form.
The symbol <-
is the assignment operator. It assigns a value on the right side of the operator to a variable on the left side. It functions, for us, like an equals (=) sign.
# Convert mtcars into a matrix of numbers
# Assign the output to the variable data
data <- as.matrix(mtcars)
#can also use an equals sign = instead of <-
The heatmap()
function is powerful: It not only converts our data values to colors, it also rearranges the rows (automobiles) and columns (performance features) so we can more easily find patterns in the data.
# A heat map is a color image of our data with dendrograms
heatmap(data)
The rows correspond to cars (observations) and the columns to the 10 performance features.
The dendrograms (or tree diagrams) show how close the cars and features are according to the values in our data set.
In the default coloring scheme, the highest values have the darkest colors. We can see that some features disp
and hp
have higher values than others, but otherwise the visualization is not helpful.
Look at the mtcars table. Different features have very different scales, so what is high (red) for one feature, e.g. cyl
, is low for another features, e.g. disp
.
The scale()
function normalizes the features so they are comparable.
# Let's change the range of each feature so they are comparable
# We'll assigne the output to a new variable data_scaled
data_scaled <- scale(data)
#first few rows of the scaled data
head(data_scaled)
## mpg cyl disp hp drat
## Mazda RX4 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137
## Mazda RX4 Wag 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137
## Datsun 710 0.4495434 -1.2248578 -0.99018209 -0.7830405 0.4739996
## Hornet 4 Drive 0.2172534 -0.1049878 0.22009369 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345 1.0148821 1.04308123 0.4129422 -0.8351978
## Valiant -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
## wt qsec vs am gear
## Mazda RX4 -0.610399567 -0.7771651 -0.8680278 1.1899014 0.4235542
## Mazda RX4 Wag -0.349785269 -0.4637808 -0.8680278 1.1899014 0.4235542
## Datsun 710 -0.917004624 0.4260068 1.1160357 1.1899014 0.4235542
## Hornet 4 Drive -0.002299538 0.8904872 1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout 0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
## Valiant 0.248094592 1.3269868 1.1160357 -0.8141431 -0.9318192
## carb
## Mazda RX4 0.7352031
## Mazda RX4 Wag 0.7352031
## Datsun 710 -1.1221521
## Hornet 4 Drive -1.1221521
## Hornet Sportabout -0.5030337
## Valiant -1.1221521
Let’s see if a heatmap for the scaled data is more informative.
# A heat map is a color image of our data with dendrograms
heatmap(data_scaled)
Now the patterns emerge!
What relationships do you find? Are the values and groupings for wt
and mpg
surprising? Does the clustering of vehicles make sense?
We can use a color palette to change the color coding and style of our heatmap.
RColorBrewer
is an R package that contains ready-to-use color palettes for creating nice graphics. - RColorBrewer package created by a woman named Brewer
# Packages are loaded with the library() function
library(RColorBrewer)
# Parameters for plotting
par(cex = 0.5)
# Get a graphic for all color schemes
display.brewer.all()
The default color-coding by heatmap is “YlOrRd” which is the top row.
We can use any of the palettes provided. Perhaps another scheme reveals relationships in the data more effectively or it is just more fun.
# Change the arguemnt in parentheses to any of the palettes
heatmap(data_scaled, col=brewer.pal(8,"RdPu"))
Click on the Knit
icon next to the ball of blue yarn and select Knit it HTML. This will create an html file of your report. We will publish our reports online so we can share what we’ve done with others.
Great work! We learned a lot about R:
help()
, head()
, and heatmap
()We will get lots of practice with all the functionality in our activities. We will use heatmaps and dendrograms to look for patterns in breast cancer gene expression data.