Today we are going to tidy and process our data into a suitable format for future plotting.
I will break down the steps for manipulating and tidying the data into 6 main steps below.

Load Packages and Read Data

Similar to last week, let’s begin by loading the packages that we will need today.

library(tidyverse)
library(here)

Great. Now I will read in the data the same way we did last week.
This step only needs to be done once

#list files
files <- list.files(here::here(),full.names = TRUE)[1:3]
files
## [1] "/Users/grad/Box/Joy_worm_images/Joy/20200519_jordan_H01.csv"
## [2] "/Users/grad/Box/Joy_worm_images/Joy/20200521_jordan_H02.csv"
## [3] "/Users/grad/Box/Joy_worm_images/Joy/20200526_jordan_H03.csv"
#read in data from files
worms <- purrr::map_dfr(files, ~readr::read_csv(.x))
First 4 rows of data
X1 Label Area Angle Length
1 p01-growth-H01-2X_B01.TIF 81 0.000 80.435
2 p01-growth-H01-2X_B01.TIF 5 0.000 3.875
3 p01-growth-H01-2X_B01.TIF 77 0.000 76.811
4 p01-growth-H01-2X_B01.TIF 6 36.027 5.101

Integrate dplyr and tidyr to manipulate and tidy data

Goal: Add a new column called Row. Select only the columns we need for future analysis.

This first step has two parts.

  1. We are going to use dplyr::mutate to add a column called Row. At this point we are just going to assign this new column with the actual row number. So Row 5 will have the value 5, Row 50 will be 50, Row 100 will be 100 and so on. We will be using this Row for other things later on.

  2. We will use dplyr::select to select only the following columns: Row, Label, and Length. These are the only three we need for our next steps.

Notice, we are using pipes for the first time! It makes coding much easier since we can now do these two steps back to back. R will already know what to use as an input for each function so you only need to designate the dataframe worms once.

step1 <- worms %>%
  dplyr::mutate(Row = row_number()) %>%
  dplyr::select(Row, Label, Length)
First 4 rows of data
Row Label Length
1 p01-growth-H01-2X_B01.TIF 80.435
2 p01-growth-H01-2X_B01.TIF 3.875
3 p01-growth-H01-2X_B01.TIF 76.811
4 p01-growth-H01-2X_B01.TIF 5.101

Goal: Separate the column Label into multiple columns of information

We talked about how the column Label holds a lot of information that we would like to separate into multiple columns. We will do this with two main steps.

  1. The bulk of our separating happens here. We use tidyr::separate to separate Label into 5 columns (Plate, Experiment, Hour, Magnification, and Well) by any punctuation.
    • If we just do this we will notice that the Hour column has values that look like this -> H01, H02… H72. This is not quite what we would like. We do not want the letter “H” to be a part of this column. In order to take care of this we must do the following step.
  2. We will change the column Hour using dplyr::mutate. We are also using a new package here called stringr, I don’t want to spend time talking about this additional package but basically we use stringr::str_extract to identify certain patterns in strings of words.
    • In this case the pattern we are looking for is 2 digits next to each other, to write this in code we say pattern = “[:digit:]{2}”.
step2 <- step1 %>%
  tidyr::separate(Label, into=c("Plate", "Experiment", "Hour", "Magnification", "Well"), sep="[[:punct:]]") %>%
  dplyr::mutate(Hour = stringr::str_extract(Hour, pattern = "[:digit:]{2}"))
## Warning: Expected 5 pieces. Additional pieces discarded in 150 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
First 4 rows of data
Row Plate Experiment Hour Magnification Well Length
1 p01 growth 01 2X B01 80.435
2 p01 growth 01 2X B01 3.875
3 p01 growth 01 2X B01 76.811
4 p01 growth 01 2X B01 5.101

Also notice that we get a warning out of this run. R is alerting us that it expected to separate the column Label into 6 pieces but we only gave it 5 new columns. It is telling us that it discarded the additional information it separated. In our case this is perfectly okay. However you can image that in some other case this warning would be helpful to alerting you that you forgot to include enough columns to contain all of your information.

Goal: Group the data by Animal (ie. Length and Width measured for a single animal)

When you guys were collecting measurements, you were careful to always measure the length of an animal first and the width of the same animal second. So in our dataframe every two rows corresponds to a single animal. We now need to make that designation. This will be done in a single step.

We will use dplyr::group_by to both group the data and create a new column to show these groupoings. (I actually just learned how to do this a few weeks ago and tbh I am not entirely sure how it works but it works so that’s what matters)

step3 <- step2 %>%
  dplyr::group_by(Animal = rep(row_number(), length.out = n(), each = 2)) 
First 4 rows of data
Row Plate Experiment Hour Magnification Well Length Animal
1 p01 growth 01 2X B01 80.435 1
2 p01 growth 01 2X B01 3.875 1
3 p01 growth 01 2X B01 76.811 2
4 p01 growth 01 2X B01 5.101 2

Now we should have a dataframe that has a new column called Animal that shows a single animal for each Length and Width measurement.

Goal: Tell R which measurement is a Length and which is a Width

So now that we have grouped the data by Animal we want to actually designate which value is Length and which is Width. To do this we are going to use dplyr::mutate again. But now we are going to change the column Row that we created in step 1.

We need a simple way to tell R which measurements should be Length and which should be Width. What I came up with is telling R that we have two possible scenarios (or cases). A case where:

  1. The value in the Length column is less than 60 – and therefore is a Width measurement
    or ….

  2. The value in Length is greater than or equal to 60 – and therefore is a Length measurement.

To apply these two conditions in code we use the function dplyr::case_when

step4 <- step3 %>%
  dplyr::mutate(Row = dplyr::case_when(Length < 60 ~ "Width",
                                       Length >= 60 ~ "Length"))
First 4 rows of data
Row Plate Experiment Hour Magnification Well Length Animal
Length p01 growth 01 2X B01 80.435 1
Width p01 growth 01 2X B01 3.875 1
Length p01 growth 01 2X B01 76.811 2
Width p01 growth 01 2X B01 5.101 2

And just like that we now know which Rows correspond to Length measurements and which correspond to Width. I tested this out with everyones data and it should uniformly work for you all

Goal: Spread out the Length and Width data into separate columns

Now we are getting towards the end of the “tidy & manipulate” section. The final thing we want to do here is spread out the data to make it wider. We would like to have two separate columns for Length and Width. To do so we will use the function tidyr::pivot_wider.

  • The new column names we want are in the current column Row while the values for each are in the current column Length.
step5 <- step4 %>%
  tidyr::pivot_wider(names_from = Row, values_from = Length)
First 4 rows of data
Plate Experiment Hour Magnification Well Animal Length Width
p01 growth 01 2X B01 1 80.435 3.875
p01 growth 01 2X B01 2 76.811 5.101
p01 growth 01 2X B01 3 84.011 5.080
p01 growth 01 2X B01 4 70.411 4.178

Process Data

The last thing we are going to do is a bit of data processing. We talked about wanting to not only be able to plot what happens Length and Width over time but also Volume. As such, we will first need to create a Volume column.

Goal: Calculate volume of an animal and store this value in a new column

Again we are using dplyr::mutate to add a new column. We will actually create 2 new columns. The first will be Radius. This time instead of assigning a new column to a single value we will be assigning it to an equation.

We know that the Radius of an object is simply its Width/2. So all we need to tell R is to do this calculation. Similarly we will do this to assign the Volume column (except in this case the equation is slightly longer).
Note: R already knows pi stands for the long mathematical constant 3.14…

step6 <- step5 %>%
    dplyr::mutate(Radius = Width/2, 
                  Volume = pi*Radius^2*Length)
First 4 rows of data
Plate Experiment Hour Magnification Well Animal Length Width Radius Volume
p01 growth 01 2X B01 1 80.435 3.875 1.9375 948.5896
p01 growth 01 2X B01 2 76.811 5.101 2.5505 1569.7263
p01 growth 01 2X B01 3 84.011 5.080 2.5400 1702.7601
p01 growth 01 2X B01 4 70.411 4.178 2.0890 965.3110

Goal: Convert pixels to microns

Great! So we could basically be done now. But I want to do one last thing. The measurements you took from my images were all in pixels. We will now convert pixels to microns. This again is done using dplyr::mutate (a function with extreme versatility and application if you haven’t realized already).

Essentially we are replacing the values already held in these columns.

tidydata <- step6 %>%
    dplyr::mutate(Length = 3.2937*Length,
                  Width = 3.2937*Width,
                  Radius = 3.2937*Radius,
                  Volume = 3.2937*Volume)
First 4 rows of data
Plate Experiment Hour Magnification Well Animal Length Width Radius Volume
p01 growth 01 2X B01 1 264.9288 12.76309 6.381544 3124.370
p01 growth 01 2X B01 2 252.9924 16.80116 8.400582 5170.208
p01 growth 01 2X B01 3 276.7070 16.73200 8.365998 5608.381
p01 growth 01 2X B01 4 231.9127 13.76108 6.880539 3179.445

Great job! Now we have a tidy dataframe that is ready for us to plot. We will learn plotting basics next week.

Try these steps out on your own this week. Now that you have been introduced to piping (%>%) can you figure out how to string all these steps into a single block of code?

ie:

tidydata <- worms %>%
            (step 1 code) %>%
            (step 2 code) %>%
            (step 3 code) %>%
            etc...