Tall-to-wide
- Let’s have a look at the ChickWeight dataset.
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
tail(ChickWeight)
## weight Time Chick Diet
## 573 155 12 50 4
## 574 175 14 50 4
## 575 205 16 50 4
## 576 234 18 50 4
## 577 264 20 50 4
## 578 264 21 50 4
- These cohort of chicks were fed with different diets (1-4) across different time intervals with their weights taken for each time interval.
- What if we would like to visualise more compactly the weight across different time intervals for each of the diets?
- dcast() function does just that. It converts a tall data frame to a wide one.
- The value.var argument refers to the variable with the measured values while the formula argument refers to the grouping of these measured values.
wide <- dcast(data=ChickWeight, formula=Diet + Chick ~ Time, value.var="weight")
dim(wide); head(wide)
## [1] 50 14
## Diet Chick 0 2 4 6 8 10 12 14 16 18 20 21
## 1 1 18 39 35 NA NA NA NA NA NA NA NA NA NA
## 2 1 16 41 45 49 51 57 51 54 NA NA NA NA NA
## 3 1 15 41 49 56 64 68 68 67 68 NA NA NA NA
## 4 1 13 41 48 53 60 65 67 71 70 71 81 91 96
## 5 1 9 42 51 59 68 85 96 90 92 93 100 100 98
## 6 1 20 41 47 54 58 65 73 77 89 98 107 115 117
Wide-to-tall
- Let’s reshape the previously dcast-ed data frame back to it’s original tall version
- The measure.var argument refers to the variable with the measured values. This normally comprise of the stretch of columns after the grouping variable(s). The id.vars argument refers to the grouping of these measured values.
tall <- melt(data=wide, id.vars=c("Diet", "Chick"), measure.vars=names(wide)[-c(1:2)])
# Rename columm names
names(tall)[3] <- "Time"
names(tall)[4] <- "weight"
dim(tall); head(tall)
## [1] 600 4
## Diet Chick Time weight
## 1 1 18 0 39
## 2 1 16 0 41
## 3 1 15 0 41
## 4 1 13 0 41
## 5 1 9 0 42
## 6 1 20 0 41
Another example
- We’ll use another dataset to further illustrate the concept of reshaping data frames.
- Download and read data frame.
# Download file
if(!file.exists("Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt")) {
download.file(url="https://raw.githubusercontent.com/riazn/biallelic_hr/57081bbcdf4ed7102c766e7d1792e1780a02f2e2/Supplementary_Files/Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt", destfile="Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt", method="curl")
}
# Read file
wide <- read.table("Matrix_Biallelic_Monoallelic_Pathogenic_VUS_All_cancers_Mutation_Types_Paper.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)
# Create Gene column and remove rownames
wide <- data.frame(Gene=row.names(wide), wide)
row.names(wide) <- NULL
# Check dimensions
dim(wide)
## [1] 102 8179
# Sneak peek
wide[1:5,1:5]
## Gene TCGA.OR.A5LD TCGA.OR.A5JI TCGA.OR.A5JL TCGA.OR.A5K6
## 1 BRCA1 0 0 0 0
## 2 BRCA2 0 0 0 0
## 3 EXO1 0 0 0 0
## 4 FAM175A 0 0 0 0
## 5 MRE11A 0 0 0 0
- The gene status are given for each sample (represented by the prefix “TCGA”).
- Here, we would like to transpose the data frame so the samples will be row-wise. This is because, as you will see later, we would like to annotate these samples with their cancer type. And this process of annotating the samples will be much easier with the samples as rows.
tall <- melt(data=wide, id.vars="Gene", measure.vars=names(wide)[-1])
# Rename columns
names(tall)[2] <- "Sample"
names(tall)[3] <- "Gene_Status"
# Tidy Sample column
tall$Sample <- gsub("\\.", "-", tall$Sample)
# Check dimensions
dim(tall)
## [1] 834156 3
# Sneak peek
head(tall)
## Gene Sample Gene_Status
## 1 BRCA1 TCGA-OR-A5LD 0
## 2 BRCA2 TCGA-OR-A5LD 0
## 3 EXO1 TCGA-OR-A5LD 0
## 4 FAM175A TCGA-OR-A5LD 0
## 5 MRE11A TCGA-OR-A5LD 0
## 6 NBN TCGA-OR-A5LD 0
- Download and read the file with sample names that correspond to breast cancer.
# Download file
if(!file.exists("BRCA_Tumor_samples_Looked_at.csv")) {
download.file(url="https://raw.githubusercontent.com/riazn/biallelic_hr/57081bbcdf4ed7102c766e7d1792e1780a02f2e2/Supplementary_Files/samples_looked_at/BRCA_Tumor_samples_Looked_at.csv", destfile="BRCA_Tumor_samples_Looked_at.csv", method="curl")
}
# Read file
breast <- read.csv("BRCA_Tumor_samples_Looked_at.csv", sep="\t", stringsAsFactors=FALSE)
- Retrieve the breast cancer samples from the original data frame (tall) using the merge() function.
- More on merging data frames here.
breast_only <- merge(tall, breast, by.x="Sample", by.y="ids")
# Check dimensions
dim(breast_only)
## [1] 94758 4
# Sneak peek
head(breast_only)
## Sample Gene Gene_Status cancer
## 1 TCGA-A1-A0SB BRCA1 0 BRCA
## 2 TCGA-A1-A0SB XRCC2 0 BRCA
## 3 TCGA-A1-A0SB SLX1B 0 BRCA
## 4 TCGA-A1-A0SB UIMC1 0 BRCA
## 5 TCGA-A1-A0SB RNF168 0 BRCA
## 6 TCGA-A1-A0SB XRCC3 0 BRCA
- As you can see, the retrieval of a subset of cancer type (in this case cancer type that corresponds to breast) is much feasible when the samples are row-wise, and that is made possible after “melting” the data frame.