Reshaping Data Frames

Prologue

  • We may occasionally come across unusually tall or wide data frames.
  • These data frames normally comprise of subjects whose measurements (values) were taken for different conditions. These conditions normally share the same theme, for example, different diets, gene expression etc.
  • In this session, we’ll go through 2 functions from the reshape2 package to reshape these unusually tall or wide data frames.
  • We’ll use the ChickWeight dataset from base R and then another example using cancer dataset published here.
library(reshape2)

Tall-to-wide

  • Let’s have a look at the ChickWeight dataset.
head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
tail(ChickWeight)
##     weight Time Chick Diet
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4
  • These cohort of chicks were fed with different diets (1-4) across different time intervals with their weights taken for each time interval.
  • What if we would like to visualise more compactly the weight across different time intervals for each of the diets?
  • dcast() function does just that. It converts a tall data frame to a wide one.
  • The value.var argument refers to the variable with the measured values while the formula argument refers to the grouping of these measured values.
wide <- dcast(data=ChickWeight, formula=Diet + Chick ~ Time, value.var="weight")
dim(wide); head(wide)
## [1] 50 14
##   Diet Chick  0  2  4  6  8 10 12 14 16  18  20  21
## 1    1    18 39 35 NA NA NA NA NA NA NA  NA  NA  NA
## 2    1    16 41 45 49 51 57 51 54 NA NA  NA  NA  NA
## 3    1    15 41 49 56 64 68 68 67 68 NA  NA  NA  NA
## 4    1    13 41 48 53 60 65 67 71 70 71  81  91  96
## 5    1     9 42 51 59 68 85 96 90 92 93 100 100  98
## 6    1    20 41 47 54 58 65 73 77 89 98 107 115 117

Wide-to-tall

  • Let’s reshape the previously dcast-ed data frame back to it’s original tall version
  • The measure.var argument refers to the variable with the measured values. This normally comprise of the stretch of columns after the grouping variable(s). The id.vars argument refers to the grouping of these measured values.
tall <- melt(data=wide, id.vars=c("Diet", "Chick"), measure.vars=names(wide)[-c(1:2)])

# Rename columm names
names(tall)[3] <- "Time"
names(tall)[4] <- "weight"

dim(tall); head(tall)
## [1] 600   4
##   Diet Chick Time weight
## 1    1    18    0     39
## 2    1    16    0     41
## 3    1    15    0     41
## 4    1    13    0     41
## 5    1     9    0     42
## 6    1    20    0     41

Another example

  • We’ll use another dataset to further illustrate the concept of reshaping data frames.
  • Download and read data frame.
# Download file
if(!file.exists("Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt")) {
download.file(url="https://raw.githubusercontent.com/riazn/biallelic_hr/57081bbcdf4ed7102c766e7d1792e1780a02f2e2/Supplementary_Files/Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt", destfile="Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt", method="curl")
}

# Read file
wide <- read.table("Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)

# Create Gene column and remove rownames
wide <- data.frame(Gene=row.names(wide), wide)

row.names(wide) <- NULL

# Check dimensions
dim(wide)
## [1]  102 8179
# Sneak peek
wide[1:5,1:5]
##      Gene TCGA.OR.A5LD TCGA.OR.A5JI TCGA.OR.A5JL TCGA.OR.A5K6
## 1   BRCA1            0            0            0            0
## 2   BRCA2            0            0            0            0
## 3    EXO1            0            0            0            0
## 4 FAM175A            0            0            0            0
## 5  MRE11A            0            0            0            0
  • The gene status are given for each sample (represented by the prefix “TCGA”).
  • Here, we would like to transpose the data frame so the samples will be row-wise. This is because, as you will see later, we would like to annotate these samples with their cancer type. And this process of annotating the samples will be much easier with the samples as rows.
tall <- melt(data=wide, id.vars="Gene", measure.vars=names(wide)[-1])

# Rename columns
names(tall)[2] <- "Sample"
names(tall)[3] <- "Gene_Status"

# Tidy Sample column
tall$Sample <- gsub("\\.", "-", tall$Sample)

# Check dimensions
dim(tall)
## [1] 834156      3
# Sneak peek
head(tall)
##      Gene       Sample Gene_Status
## 1   BRCA1 TCGA-OR-A5LD           0
## 2   BRCA2 TCGA-OR-A5LD           0
## 3    EXO1 TCGA-OR-A5LD           0
## 4 FAM175A TCGA-OR-A5LD           0
## 5  MRE11A TCGA-OR-A5LD           0
## 6     NBN TCGA-OR-A5LD           0
  • Download and read the file with sample names that correspond to breast cancer.
# Download file
if(!file.exists("BRCA_Tumor_samples_Looked_at.csv")) {
download.file(url="https://raw.githubusercontent.com/riazn/biallelic_hr/57081bbcdf4ed7102c766e7d1792e1780a02f2e2/Supplementary_Files/samples_looked_at/BRCA_Tumor_samples_Looked_at.csv", destfile="BRCA_Tumor_samples_Looked_at.csv", method="curl")
}

# Read file
breast <- read.csv("BRCA_Tumor_samples_Looked_at.csv", sep="\t", stringsAsFactors=FALSE)
  • Retrieve the breast cancer samples from the original data frame (tall) using the merge() function.
  • More on merging data frames here.
breast_only <- merge(tall, breast, by.x="Sample", by.y="ids")

# Check dimensions
dim(breast_only)
## [1] 94758     4
# Sneak peek
head(breast_only)
##         Sample   Gene Gene_Status cancer
## 1 TCGA-A1-A0SB  BRCA1           0   BRCA
## 2 TCGA-A1-A0SB  XRCC2           0   BRCA
## 3 TCGA-A1-A0SB  SLX1B           0   BRCA
## 4 TCGA-A1-A0SB  UIMC1           0   BRCA
## 5 TCGA-A1-A0SB RNF168           0   BRCA
## 6 TCGA-A1-A0SB  XRCC3           0   BRCA
  • As you can see, the retrieval of a subset of cancer type (in this case cancer type that corresponds to breast) is much feasible when the samples are row-wise, and that is made possible after “melting” the data frame.

Summary

  • Use dcast() to reshape data frames from tall to wide.
  • Use melt() to reshape data frames from wide to tall.