Tidy&Process Data

Today we are going to tidy and process our data into a suitable format for future plotting.
I will break down the steps for manipulating and tidying the data into 6 main steps below.

Load Packages and Read Data

Similar to last week, let’s begin by loading the packages that we will need today.

library(tidyverse)
library(here)

Great. Now I will read in the data the same way we did last week.
This step only needs to be done once

#list files
files <- list.files(here::here(),full.names = TRUE)[1:3]
files

## [1] "/Users/grad/Box/Joy_worm_images/Joy/20200519_jordan_H01.csv"
## [2] "/Users/grad/Box/Joy_worm_images/Joy/20200521_jordan_H02.csv"
## [3] "/Users/grad/Box/Joy_worm_images/Joy/20200526_jordan_H03.csv"

#read in data from files
worms <- purrr::map_dfr(files, ~readr::read_csv(.x))

First 4 rows of data
X1	Label	Area	Angle	Length
1	p01-growth-H01-2X_B01.TIF	81	0.000	80.435
2	p01-growth-H01-2X_B01.TIF	5	0.000	3.875
3	p01-growth-H01-2X_B01.TIF	77	0.000	76.811
4	p01-growth-H01-2X_B01.TIF	6	36.027	5.101

Integrate dplyr and tidyr to manipulate and tidy data

Goal: Add a new column called Row. Select only the columns we need for future analysis.

This first step has two parts.

We are going to use dplyr::mutate to add a column called Row. At this point we are just going to assign this new column with the actual row number. So Row 5 will have the value 5, Row 50 will be 50, Row 100 will be 100 and so on. We will be using this Row for other things later on.
We will use dplyr::select to select only the following columns: Row, Label, and Length. These are the only three we need for our next steps.

Notice, we are using pipes for the first time! It makes coding much easier since we can now do these two steps back to back. R will already know what to use as an input for each function so you only need to designate the dataframe worms once.

step1 <- worms %>%
  dplyr::mutate(Row = row_number()) %>%
  dplyr::select(Row, Label, Length)

First 4 rows of data
Row	Label	Length
1	p01-growth-H01-2X_B01.TIF	80.435
2	p01-growth-H01-2X_B01.TIF	3.875
3	p01-growth-H01-2X_B01.TIF	76.811
4	p01-growth-H01-2X_B01.TIF	5.101

Goal: Separate the column Label into multiple columns of information

We talked about how the column Label holds a lot of information that we would like to separate into multiple columns. We will do this with two main steps.

The bulk of our separating happens here. We use tidyr::separate to separate Label into 5 columns (Plate, Experiment, Hour, Magnification, and Well) by any punctuation.
- If we just do this we will notice that the Hour column has values that look like this -> H01, H02… H72. This is not quite what we would like. We do not want the letter “H” to be a part of this column. In order to take care of this we must do the following step.
We will change the column Hour using dplyr::mutate. We are also using a new package here called stringr, I don’t want to spend time talking about this additional package but basically we use stringr::str_extract to identify certain patterns in strings of words.
- In this case the pattern we are looking for is 2 digits next to each other, to write this in code we say pattern = “[:digit:]{2}”.

step2 <- step1 %>%
  tidyr::separate(Label, into=c("Plate", "Experiment", "Hour", "Magnification", "Well"), sep="[[:punct:]]") %>%
  dplyr::mutate(Hour = stringr::str_extract(Hour, pattern = "[:digit:]{2}"))

## Warning: Expected 5 pieces. Additional pieces discarded in 150 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

First 4 rows of data
Row	Plate	Experiment	Hour	Magnification	Well	Length
1	p01	growth	01	2X	B01	80.435
2	p01	growth	01	2X	B01	3.875
3	p01	growth	01	2X	B01	76.811
4	p01	growth	01	2X	B01	5.101

Also notice that we get a warning out of this run. R is alerting us that it expected to separate the column Label into 6 pieces but we only gave it 5 new columns. It is telling us that it discarded the additional information it separated. In our case this is perfectly okay. However you can image that in some other case this warning would be helpful to alerting you that you forgot to include enough columns to contain all of your information.

Goal: Group the data by Animal (ie. Length and Width measured for a single animal)

When you guys were collecting measurements, you were careful to always measure the length of an animal first and the width of the same animal second. So in our dataframe every two rows corresponds to a single animal. We now need to make that designation. This will be done in a single step.

We will use dplyr::group_by to both group the data and create a new column to show these groupoings. (I actually just learned how to do this a few weeks ago and tbh I am not entirely sure how it works but it works so that’s what matters)

step3 <- step2 %>%
  dplyr::group_by(Animal = rep(row_number(), length.out = n(), each = 2))

First 4 rows of data
Row	Plate	Experiment	Hour	Magnification	Well	Length	Animal
1	p01	growth	01	2X	B01	80.435	1
2	p01	growth	01	2X	B01	3.875	1
3	p01	growth	01	2X	B01	76.811	2
4	p01	growth	01	2X	B01	5.101	2

Now we should have a dataframe that has a new column called Animal that shows a single animal for each Length and Width measurement.

Goal: Tell R which measurement is a Length and which is a Width

So now that we have grouped the data by Animal we want to actually designate which value is Length and which is Width. To do this we are going to use dplyr::mutate again. But now we are going to change the column Row that we created in step 1.

We need a simple way to tell R which measurements should be Length and which should be Width. What I came up with is telling R that we have two possible scenarios (or cases). A case where:

The value in the Length column is less than 60 – and therefore is a Width measurement
or ….
The value in Length is greater than or equal to 60 – and therefore is a Length measurement.

To apply these two conditions in code we use the function dplyr::case_when

step4 <- step3 %>%
  dplyr::mutate(Row = dplyr::case_when(Length < 60 ~ "Width",
                                       Length >= 60 ~ "Length"))

First 4 rows of data
Row	Plate	Experiment	Hour	Magnification	Well	Length	Animal
Length	p01	growth	01	2X	B01	80.435	1
Width	p01	growth	01	2X	B01	3.875	1
Length	p01	growth	01	2X	B01	76.811	2
Width	p01	growth	01	2X	B01	5.101	2

And just like that we now know which Rows correspond to Length measurements and which correspond to Width. I tested this out with everyones data and it should uniformly work for you all

Goal: Spread out the Length and Width data into separate columns

Now we are getting towards the end of the “tidy & manipulate” section. The final thing we want to do here is spread out the data to make it wider. We would like to have two separate columns for Length and Width. To do so we will use the function tidyr::pivot_wider.

The new column names we want are in the current column Row while the values for each are in the current column Length.

step5 <- step4 %>%
  tidyr::pivot_wider(names_from = Row, values_from = Length)

First 4 rows of data
Plate	Experiment	Hour	Magnification	Well	Animal	Length	Width
p01	growth	01	2X	B01	1	80.435	3.875
p01	growth	01	2X	B01	2	76.811	5.101
p01	growth	01	2X	B01	3	84.011	5.080
p01	growth	01	2X	B01	4	70.411	4.178

Process Data

The last thing we are going to do is a bit of data processing. We talked about wanting to not only be able to plot what happens Length and Width over time but also Volume. As such, we will first need to create a Volume column.

Goal: Calculate volume of an animal and store this value in a new column

Again we are using dplyr::mutate to add a new column. We will actually create 2 new columns. The first will be Radius. This time instead of assigning a new column to a single value we will be assigning it to an equation.

We know that the Radius of an object is simply its Width/2. So all we need to tell R is to do this calculation. Similarly we will do this to assign the Volume column (except in this case the equation is slightly longer).
Note: R already knows pi stands for the long mathematical constant 3.14…

step6 <- step5 %>%
    dplyr::mutate(Radius = Width/2, 
                  Volume = pi*Radius^2*Length)

First 4 rows of data
Plate	Experiment	Hour	Magnification	Well	Animal	Length	Width	Radius	Volume
p01	growth	01	2X	B01	1	80.435	3.875	1.9375	948.5896
p01	growth	01	2X	B01	2	76.811	5.101	2.5505	1569.7263
p01	growth	01	2X	B01	3	84.011	5.080	2.5400	1702.7601
p01	growth	01	2X	B01	4	70.411	4.178	2.0890	965.3110

Goal: Convert pixels to microns

Great! So we could basically be done now. But I want to do one last thing. The measurements you took from my images were all in pixels. We will now convert pixels to microns. This again is done using dplyr::mutate (a function with extreme versatility and application if you haven’t realized already).

Essentially we are replacing the values already held in these columns.

tidydata <- step6 %>%
    dplyr::mutate(Length = 3.2937*Length,
                  Width = 3.2937*Width,
                  Radius = 3.2937*Radius,
                  Volume = 3.2937*Volume)

First 4 rows of data
Plate	Experiment	Hour	Magnification	Well	Animal	Length	Width	Radius	Volume
p01	growth	01	2X	B01	1	264.9288	12.76309	6.381544	3124.370
p01	growth	01	2X	B01	2	252.9924	16.80116	8.400582	5170.208
p01	growth	01	2X	B01	3	276.7070	16.73200	8.365998	5608.381
p01	growth	01	2X	B01	4	231.9127	13.76108	6.880539	3179.445

Great job! Now we have a tidy dataframe that is ready for us to plot. We will learn plotting basics next week.

Try these steps out on your own this week. Now that you have been introduced to piping (%>%) can you figure out how to string all these steps into a single block of code?

ie:

tidydata <- worms %>%
            (step 1 code) %>%
            (step 2 code) %>%
            (step 3 code) %>%
            etc...