Data Wrangling Basics

Author

Brian Gural, Lorrie He, JP Flores

What is data wrangling?

Data wrangling, manipulation, or cleaning is the process of transforming data into a format that is more suitable for analysis. This can include removing missing values, changing the format of data, or combining multiple datasets.
There’s rarely a single way to approach any given data-wrangling problem! Expanding your “toolkit” allows you to tackle problems from different angles.

Tip

Before you begin wrangling data, you should be able to:

Define how you want the data to look and why
Document it well so that others (and future you!) know what you did
Know what tools you have and how to use them

Working with vectors

Pulling out specific parts of a data set is important when analyzing with R. Indexing, or accessing elements, subsets data based on numeric positions in a vector by using brackets. i.e. the 5th element in a vector will be returned if you run vector[5]. It’s also helpful for getting several elements at once, or reordering data. Here are some examples:

# First, we'll make a vector to play with
names <- c("rosalind", "marie", "barbara")

# if we print the output, we'd get:
names

[1] "rosalind" "marie"    "barbara"

# If we want to access the first name, we can use brackets and the position of the name in the vector:

names[1]

[1] "rosalind"

# This works with any position, for example the third name:
names[3]

[1] "barbara"

# You can index more than one position at a time too:
names[c(1,2)]

[1] "rosalind" "marie"

# Changing the order of numbers you supply changes the order of names returned
names[c(2,1)]

[1] "marie"    "rosalind"

Working with data frames

This works for two-dimensional structures too, like data frames and matrices. We’d just format it as: dataframe[row,column]. Let’s try it out:

# Make a data frame
df <- data.frame(name = c("rosalind", "marie", "barbara"),
                 nobel_prize = c("dna", "radium", "transposons"),
                 school = c("cambridge", "columbia", "cornell"))

df

      name nobel_prize    school
1 rosalind         dna cambridge
2    marie      radium  columbia
3  barbara transposons   cornell

# To get the first row:
df[1,]

      name nobel_prize    school
1 rosalind         dna cambridge

# or the first column: 
df[,1]

[1] "rosalind" "marie"    "barbara"

# for specific cells: 
df[2,3]

[1] "columbia"

# We can use the column name instead of numbers:
df[,"name"]

[1] "rosalind" "marie"    "barbara"

# We can do the same thing by using a dollar sign:
df$name

[1] "rosalind" "marie"    "barbara"

# We can also give a list of columns
# which return in the order provided
df[,c("nobel_prize","name")]

  nobel_prize     name
1         dna rosalind
2      radium    marie
3 transposons  barbara

`dplyr` verbs

The dplyr package is a powerful tool for data manipulation. It’s part of the tidyverse package, which is a collection of packages that work together to make data manipulation easier. Before manipulating data, we can take a peek with the follow functions:

dplyr::glimpse(df)

Rows: 3
Columns: 3
$ name        <chr> "rosalind", "marie", "barbara"
$ nobel_prize <chr> "dna", "radium", "transposons"
$ school      <chr> "cambridge", "columbia", "cornell"

With the glimpse function we see that this is a data frame with 3 observations and 3 variables. We can also see the type of each variable and the first few values.

Tip

dplyr functions have a lot in common:

The first argument is always a data frame
The subsequent arguments typicall describe which columns to operate on, using the variable names (without quotes)
The output is always a new data frame

The dplyr package has a set of functions called “verbs” that are used to manipulate data frames. These verbs can perform manipulations on rows of data frames (ex. filter, arrange, and distinct) and columns of data frames (ex. mutate, select, rename, and relocate). This package also has functions for working with groups (ex. group_by, summarize, and the slice family of functions.)

Row Verbs

filter allows you to…

df |> 
  filter(nobel_prize == "dna")

      name nobel_prize    school
1 rosalind         dna cambridge

arrange allows you to…

distinct allows you to…

Column Verbs

mutate allows you to…

select allows you to…

df |> 
  select(name)

      name
1 rosalind
2    marie
3  barbara

rename allows you to…

relocate allows you to…

Group Verbs

group_by allows you to…

df |> 
  filter(nobel_prize == "dna")

      name nobel_prize    school
1 rosalind         dna cambridge

summarize allows you to…

The slice family of functions allows you to…

Ungrouping with `ungroup`

the `.by` argument

Conclusion

Hopefully now, you feel a little more confident about working with vectors, data frames, and using dplyr verbs to clean and manipulate data. Happy Wrangling!

Case study

A lot of people struggle to learn how to code without having an application. Since this course is geared towards biomedical sciences, we thought you might find it easier if we work through an actual research data set. We have some data from an experiment that measured the cellular composition of samples from different treatments and genotypes.

What are some things that we, as researchers, would want to know about our data?

Did the experiment work?

Things to consider:

Your controls or expected features!

Do we see differences between our experimental groups?

To get at these questions, we need to be able to manipulate our data into the formats needed to check those features and for plotting. To start, lets take a look at how the results are structured before we start planning how to do the processing.

Getting familiar with the data

# Load the data. The sample IDs were stored as the first row, so lets make those the row.names
cell_props <- read.csv("03-dataWrangling_files/cellProportions.csv",
                       row.names = 1)

head(cell_props)

            Cardiomyocytes Fibroblast Endothelial.Cells Macrophage
whole_2          0.6516497 0.08856450       0.067004520 0.17605980
fraction_13      0.8243539 0.03702213       0.063869994 0.06917268
fraction_12      0.8954119 0.02126763       0.044361580 0.03895884
fraction_19      0.0000000 0.99832710       0.001672903 0.00000000
fraction_18      0.0000000 1.00000000       0.000000000 0.00000000
whole_16         0.8196176 0.02082482       0.088890150 0.05008898
            Pericytes.SMC
whole_2       0.016721500
fraction_13   0.005581313
fraction_12   0.000000000
fraction_19   0.000000000
fraction_18   0.000000000
whole_16      0.020578470

It looks like our data, cell_props, is formatted with the samples as rows and the different cell types measured as columns. This follows the tidy data style (https://r4ds.had.co.nz/tidy-data.html). Being tidy is an approach to handling data in R that aims to be clear and readable. The universe of associated packages are called the tidyverse, and it’s a hot-topic in the R world. Later, we’ll get into specifics of how it differs from what we’ve been doing so far.

When assessing data, it’s good to consider what features you’d expect from a given data set. This helps you know if something has gone wrong before you’ve gotten your hands on it. For example, raw RNA-seq matrices should have values that go up to 100,000s. So if you only see small numbers in the data provided by a collaborator, it’s likely been manipulated in some way.

In our case, we’re looking at the proportion of cell types in each sample. *Whole refers to ___ and fraction refers to ___*. Proportions, by nature, sum up to 1. In this case, checking that the values in each row sum to 1 is one way to confirm that we have what we’re expecting. Lets go ahead and check that:

rowSums(cell_props)

    whole_2 fraction_13 fraction_12 fraction_19 fraction_18    whole_16 
          1           1           1           1           1           1 
   whole_15    whole_12 fraction_14     whole_8    whole_14     whole_3 
          1           1           1           1           1           1 
 fraction_9 fraction_11     whole_6 fraction_16  fraction_8     whole_4 
          1           1           1           1           1           1 
 fraction_5  fraction_1    whole_10 fraction_10     whole_7     whole_5 
          1           1           1           1           1           1 
    whole_9  fraction_6  fraction_3 fraction_17     whole_1    whole_13 
          1           1           1           1           1           1 
 fraction_7  fraction_4    whole_11 fraction_20  fraction_2 fraction_15 
          1           1           1           1           1           1

This looks good! There are endless factors you could check about your data. It’s a good practice to consider what they could be and check them every so often throughout an analysis.

We also have the phenotypes for the samples in a separate csv file:

cell_phenos <- read.csv("03-dataWrangling_files/cellPhenotypes.csv",
                        row.names = 1)

str(cell_phenos)

'data.frame':   36 obs. of  3 variables:
 $ type     : chr  "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" ...
 $ genotype : chr  NA NA NA NA ...
 $ treatment: chr  NA NA NA NA ...

This tells us that cell_phenos is a data frame with 3 variable (columns) and 36 observations (rows). It has info about the type of sample, genotypes, and treatment status.

Planning the analysis

Let’s recount what we want to know and then what we already know about the data:

We know that the phenotype information and cell proportion measures are each their own data frames, and they seem to have corresponding row names.

We want to know:

If the controls look as we’d expect
What group differences we have

To get at the question about controls, we’d need to check cell_phenos to see which samples are from the control or experimental groups. After, we’ll plot the proportions.

flowchart LR
A[Dataframe of cell proportions] --> C{Merged proportions\n and phenotypes}
B[Dataframe of sample phenotypes]  --> C
C --> D(Steps to get data\nformatted for plotting)
D --> E[Plot of control samples]
D --> F[Plot of experiment samples]

Manipulating data frames

Summarizing and subsetting

The row names in our data aren’t very informative. Luckily, our collaborator gave us a phenotype table too. Lets load that in and get a sense of how it looks:

Let’s get more context on what’s in the data. table is a convenient way to summarize columns and lists:

# What unique values and how many of each are in the "genotype" field
table(cell_phenos$genotype)


cmAKO    WT 
    8     8

…wait, why didn’t we see any of the NAs we saw when we ran str(cell_phenos)? Turns out that table() wants you to specify that you’re interested in NAs with useNA = "ifany". Remember that ?table returns the manual for table() and works for any function!

# What unique values and how many of each are in the "type" field
table(cell_phenos$genotype, useNA = "ifany")


cmAKO    WT  <NA> 
    8     8    20

Of the samples, 8 are knock-outs, 8 are wild type, and 20 haven’t been labeled with either. So, like any good bioinformatician, you email your collaborator and ask what’s up with that. Apparently, they included some control samples. These are experimentally pure cell type fractions, so they’d expect them to mostly just include a single cell type. Because of that, the genotype and treatment factors don’t apply to them.

With that in mind, lets subset the data into the whole and fraction samples. While you could go and manually index the data, the aptly named subset() function is great for this. We can also use is.na() to pinpoint the fraction samples, since they won’t have genotype information.

# Pull out the rows with missing genotypes
fraction_phenos <- subset(cell_phenos, is.na(genotype))
head(fraction_phenos)

                              type genotype treatment
fraction_1 purified_cardiomyocytes     <NA>      <NA>
fraction_2 purified_cardiomyocytes     <NA>      <NA>
fraction_3 purified_cardiomyocytes     <NA>      <NA>
fraction_4 purified_cardiomyocytes     <NA>      <NA>
fraction_5 purified_cardiomyocytes     <NA>      <NA>
fraction_6    purified_fibroblasts     <NA>      <NA>

# Using an exclamation point inverses true/false values in R. 
# Think of it as saying,"is NOT NA"
whole_phenos <- subset(cell_phenos, !is.na(genotype))
head(whole_phenos)

                type genotype treatment
whole_1 whole_tissue       WT        MI
whole_2 whole_tissue       WT        MI
whole_3 whole_tissue       WT        MI
whole_4 whole_tissue       WT        MI
whole_5 whole_tissue       WT      Sham
whole_6 whole_tissue       WT      Sham

Combining and reordering

It’s common to join two data sets together while exploring data. In our case, we’re going to combine the phenotype and composition data sets, which helps later when we want to make some more complex plots.

Data frames can be merged in a bunch of ways, but no matter the method it’s essential that the order of samples match. R has two built-in methods, binds (cbind and rbind) and merge.

Binds slap two data frames together. cbind adds the second data frame new columns, rbind adds rows. Binds don’t consider the order of the data sets, so there’s a risk of things being out of order.

merge is similar to cbind, but matches the data sets based on a common column. In our data, we would merge proportions with the phenotypes by the sample id.

Lets outline how we would go about each of these. For cbind we’ll check the order of the data by indexing first. With merge, we’ll tell the function what we want to merge by:

cbind

# Check if the sample names match 
rownames(cell_phenos) == rownames(cell_props)

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# They don't, so lets reorder one to match the other
# This uses the cell_phenos rownames as a list to specify the order of indices 
cell_props <- cell_props[rownames(cell_phenos),]

# They should all be TRUE now
rownames(cell_phenos) == rownames(cell_props)

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE

# Now we can merge them 
data_bind <- cbind(cell_phenos, cell_props)

head(data_bind)

                              type genotype treatment Cardiomyocytes Fibroblast
fraction_1 purified_cardiomyocytes     <NA>      <NA>      0.9114677 0.01599710
fraction_2 purified_cardiomyocytes     <NA>      <NA>      0.9414552 0.01181459
fraction_3 purified_cardiomyocytes     <NA>      <NA>      0.8978371 0.01648744
fraction_4 purified_cardiomyocytes     <NA>      <NA>      0.8689154 0.05075985
fraction_5 purified_cardiomyocytes     <NA>      <NA>      0.9457260 0.01287280
fraction_6    purified_fibroblasts     <NA>      <NA>      0.0000000 0.89757600
           Endothelial.Cells Macrophage Pericytes.SMC
fraction_1        0.03976535 0.03276989   0.000000000
fraction_2        0.02656110 0.02016908   0.000000000
fraction_3        0.04457489 0.03390135   0.007199195
fraction_4        0.02353460 0.05244883   0.004341316
fraction_5        0.02151797 0.01988320   0.000000000
fraction_6        0.01375279 0.01838331   0.070287900

merge

# Specify row.names as the feature to merge by
data_merge <- merge(cell_phenos, cell_props, by = "row.names")

head(data_merge)

    Row.names                    type genotype treatment Cardiomyocytes
1  fraction_1 purified_cardiomyocytes     <NA>      <NA>      0.9114677
2 fraction_10    purified_fibroblasts     <NA>      <NA>      0.0000000
3 fraction_11 purified_cardiomyocytes     <NA>      <NA>      0.7717289
4 fraction_12 purified_cardiomyocytes     <NA>      <NA>      0.8954119
5 fraction_13 purified_cardiomyocytes     <NA>      <NA>      0.8243539
6 fraction_14 purified_cardiomyocytes     <NA>      <NA>      0.9276313
   Fibroblast Endothelial.Cells  Macrophage Pericytes.SMC
1 0.015997100       0.039765350 0.032769890  0.0000000000
2 0.993854164       0.005004414 0.001141422  0.0000000000
3 0.062611740       0.059881996 0.098142516  0.0076348910
4 0.021267630       0.044361580 0.038958840  0.0000000000
5 0.037022134       0.063869994 0.069172675  0.0055813130
6 0.008614593       0.041595527 0.021979601  0.0001790243

And we’re done. While this won’t always be the case with merge vs. bind, it might be better to use merge in this scenario. It’s a good practice to make sure your scripts are interpretable and merge requires minimal preprocessing. If you continue with programming, there will be points when you need to share your code with some or return to code you wrote months to years ago. When that happens, you’ll appreciate leaving easy-to-understand scripts!

Preparing for different visualizations

Our data is all in one place, now we just need to think of how we want to visualize it. You should already have a bit of background with ggplot2, so lets go ahead and generate some basic plots.

library(tidyverse)

ggplot(data = data_merge, aes(x = Cardiomyocytes, y = Fibroblast, color = type))+
  geom_point()

ggplot(data = data_merge, aes(x = Cardiomyocytes, fill = type))+
  geom_density(alpha = 0.5) +
  theme_minimal()

This is a nice, simple plot, but it’s missing a lot of information we’d want to know about our data. We’re probably more interested in looking at all of the cell types at once. At this point, we should ask ourselves a few questions:

What am I trying to see about the data?

Our samples have data on the proportions of many cell types. I’d want to easily compare all of these cell types at once, with samples/groups side-by-side.

What kind of plot do we want?

Pie charts are often used to visualize percents/proportions, but its difficult to see differences between two pie charts. A stacked bar plot would be a better fit, since we’re trying to compare different sample groups.

What format does my data need to be to make said plot?

This stacked bar plot would have:

Samples on the X-axis
Cell-type proportions on the Y-axis
Colors for each cell type in each bar

For ggplot to make this, our data needs to have a column specifically for each of those terms. The problem is, the data is spread accross many columns. To understand how to solve this, we first need to understand the concepts of wide and long data.

Pivoting wide and long

Data with many observations and samples is often formatted in one of two way: wide or long. Our current data is in a wide format. Wide format data is defined by having a single row for each sample and a column for each observation.

When wide data is pivoted into a long format data it condenses several columns together without losing any information. For most people, it’s easiest to understand how pivoting works by seeing it done.

Source: Garrick Aden-Buie’s (@grrrck) Tidy Animated Verbs modified by Mara Averick (@dataandme)

Back to the task at hand. We want to make a single column with values of cell type proportions and another with names of cell types. When we pivot longer, we need to indicate which columns we want to consolidate. In our case, this will be all the cell types.

data_long <- pivot_longer(data_merge, cols = c(Cardiomyocytes, Fibroblast, Endothelial.Cells, Macrophage, Pericytes.SMC), names_to = "cell.type", values_to = "proportion")

str(data_long)

tibble [180 × 6] (S3: tbl_df/tbl/data.frame)
 $ Row.names : 'AsIs' chr [1:180] "fraction_1" "fraction_1" "fraction_1" "fraction_1" ...
 $ type      : chr [1:180] "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" ...
 $ genotype  : chr [1:180] NA NA NA NA ...
 $ treatment : chr [1:180] NA NA NA NA ...
 $ cell.type : chr [1:180] "Cardiomyocytes" "Fibroblast" "Endothelial.Cells" "Macrophage" ...
 $ proportion: num [1:180] 0.9115 0.016 0.0398 0.0328 0 ...

We have a couple of changes:

There are two new columns, cell.type and proportion
We have A LOT more rows than we did originally
Our data has been coerced from a data frame to a tibble
The sample IDs were coerced to a column “Row.names” that is an ‘AsIs’ character

The first two are expected, but why do we have a tibble and what even is that? A tibble is the tidy alternative to a data frame. pivot_longer is from the tidyr package, which is part of the tidyverse. ADD STUFF ABOUT TIBBLES?

Pipes

Regarding the new column, lets replace the new column with a sample.id column with character type data in it. I’ll use multiple functions in succession to make these changes. Instead of making several lines of code, I could simplify the process by piping the functions into each other. To illustrate, I’ll show you how we would do this first without then with piping:

# mutate is a tidy function from dplyr that helps make new columns based on formulas you give it
data_example <- mutate(.data = data_long, sample.id = as.character(Row.names))

# Select is another tidy function that subsets your data by column name 
data_example <- select(.data = data_example, -Row.names)

str(data_example)

tibble [180 × 6] (S3: tbl_df/tbl/data.frame)
 $ type      : chr [1:180] "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" ...
 $ genotype  : chr [1:180] NA NA NA NA ...
 $ treatment : chr [1:180] NA NA NA NA ...
 $ cell.type : chr [1:180] "Cardiomyocytes" "Fibroblast" "Endothelial.Cells" "Macrophage" ...
 $ proportion: num [1:180] 0.9115 0.016 0.0398 0.0328 0 ...
 $ sample.id : chr [1:180] "fraction_1" "fraction_1" "fraction_1" "fraction_1" ...

# The default pipe looks like this |> 
# Think of it as saying, "Then, do this.."
data_long <- data_long |>
  mutate(sample.id = as.character(Row.names)) |> 
  select(-Row.names)
str(data_long)

tibble [180 × 6] (S3: tbl_df/tbl/data.frame)
 $ type      : chr [1:180] "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" "purified_cardiomyocytes" ...
 $ genotype  : chr [1:180] NA NA NA NA ...
 $ treatment : chr [1:180] NA NA NA NA ...
 $ cell.type : chr [1:180] "Cardiomyocytes" "Fibroblast" "Endothelial.Cells" "Macrophage" ...
 $ proportion: num [1:180] 0.9115 0.016 0.0398 0.0328 0 ...
 $ sample.id : chr [1:180] "fraction_1" "fraction_1" "fraction_1" "fraction_1" ...

In the piping example, someone reading your code can easily see what you’ve done. Take data_long and then mutate Row.names into a character column called sample.id and then select the columns that aren’t Row.names. As we make more complex manipulations later, you’ll see how much this helps keep things readable and functional.

Behind the scenes, pipes assume that you want to take the thing being piped (the object before it) and make it the first argument of the function you’re piping into. Because of that, piping data in base R only works if the target function’s first argument is the input data.

Wrangling for plotting

With our data in this format, we can make a lot of cool plots. Lets start with the bar plot we had planned.

data_long |> 
  ggplot(aes(x = sample.id, y = proportion, fill = cell.type))+
  geom_bar(position="fill", stat="identity")

It worked, but it looks… less than pleasing. Lets remind ourselves of what we wanted to see in the plot: groups side-by-side. What groups do we have? Well, there’s some controls (three types) and actual samples from the experiment. Within the actual samples, we have two genotypes and two treatments. That’s a bit too much for one plot. I’d like to start by making a plot just for the controls for now. filter from the dplyr package will help separate the groups.

data_long |> 
  filter(type != "whole_tissue") |> 
  ggplot(aes(x = sample.id, y = proportion, fill = cell.type))+
  geom_bar(position="fill", stat="identity")

Looks good, but lets pick up the pace. For this next iteration, we’ll make several changes at once, with the goal of making it easier to compare groups, more readable, and all-around nicer to look at.

data_long |> 
  filter(type != "whole_tissue") |> 
  ggplot(aes(x = sample.id, y = proportion, fill = cell.type))+
  geom_bar(position="fill", stat="identity", color = "black", width = 1) +
  facet_grid(cols=vars(type), scales = "free") +
  scale_fill_manual(values = c("#66C2A5","#FC8D62", "#8DA0CB", "#E78AC3", "#A6D854")) +
  theme_minimal() +
  theme(
    axis.title.x = element_blank(), 
    legend.title = element_blank(),
    legend.position = "bottom"
  ) +
  guides(x = guide_axis(angle = 45)) +
  labs(title = "Cell type proportions in purified control samples",
       y = "Cell Type Proportion")

For the treatment and genotype samples, it would be easier to compare shifts in individual cell types if we break up the stacked bar chart so that the cell types are spread across the x-axis. Then, we could have an individual bar for each genotype-treament at each cell type position side-by-side.

What is data wrangling?

Working with vectors

Working with data frames

dplyr verbs

Row Verbs

Column Verbs

Group Verbs

Ungrouping with ungroup

the .by argument

Conclusion

Case study

Getting familiar with the data

Planning the analysis

Manipulating data frames

Summarizing and subsetting

Combining and reordering

cbind

merge

Preparing for different visualizations

Pivoting wide and long

Pipes

Wrangling for plotting

`dplyr` verbs

Ungrouping with `ungroup`

the `.by` argument