Wrangling the behavior weight data sheet

Working with the full tidyverse and the lubridate time-package

suppressWarnings(library(tidyverse))
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
suppressWarnings(library(lubridate))
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

Read the data into R

Read the master weight sheet into R and use the gather function to go from a wide data set to a tall data set.

### read master weight file into R
weights <- read_csv("share_weight.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `ANIMAL ID` = col_character(),
##   DOB = col_character(),
##   SEX = col_character(),
##   GENOTYPE = col_character(),
##   `April 4, 2014` = col_character(),
##   `April 18, 2014` = col_character(),
##   `January 2, 2015` = col_character(),
##   `March 27, 2015` = col_character(),
##   `July 3, 2015` = col_character(),
##   `November 6, 2015` = col_character(),
##   `December 18, 2015` = col_character(),
##   `January 6, 2016` = col_character(),
##   `February 5, 2016` = col_character(),
##   `April 1, 2016` = col_character(),
##   `August 5, 2016` = col_character(),
##   `August 19, 2016` = col_character(),
##   `September 2, 2016` = col_character(),
##   `April 7, 2017` = col_character(),
##   `July 7, 2017` = col_character()
## )
## See spec(...) for full column specifications.
head(weights)
## # A tibble: 6 x 117
##   `ANIMAL ID`       DOB    SEX GENOTYPE `December 13, 2013`
##         <chr>     <chr>  <chr>    <chr>               <dbl>
## 1      749162 7/31/2013   male       wt                32.3
## 2      943492 7/31/2013   male       wt                25.9
## 3      914784 7/31/2013   male       wt                34.5
## 4       48711 7/31/2013 female       wt                24.9
## 5      722836 7/31/2013 female       wt                21.8
## 6      869852 7/31/2013 female       wt                25.5
## # ... with 112 more variables: `December 27, 2013` <dbl>, `January 10,
## #   2014` <dbl>, `January 24, 2014` <dbl>, `February 7, 2014` <dbl>,
## #   `February 21, 2014` <dbl>, `March 7, 2014` <dbl>, `March 21,
## #   2014` <dbl>, `April 4, 2014` <chr>, `April 18, 2014` <chr>, `May 2,
## #   2014` <dbl>, `May 15, 2014` <dbl>, `May 30, 2014` <dbl>, `June 13,
## #   2014` <dbl>, `June 27, 2014` <dbl>, `July 11, 2014` <dbl>, `July 25,
## #   2014` <dbl>, `August 4, 2014` <dbl>, `August 15, 2014` <dbl>, `August
## #   29, 2014` <dbl>, `September 5, 2014` <dbl>, `September 12,
## #   2014` <dbl>, `September 26, 2014` <dbl>, `October 3, 2014` <dbl>,
## #   `October 10, 2014` <dbl>, `October 24, 2014` <dbl>, `November 7,
## #   2014` <dbl>, `November 21, 2014` <dbl>, `December 5, 2014` <dbl>,
## #   `December 19, 2014` <dbl>, `January 2, 2015` <chr>, `January 16,
## #   2015` <dbl>, `January 30, 2015` <dbl>, `February 6, 2015` <dbl>,
## #   `February 13, 2015` <dbl>, `February 27, 2015` <dbl>, `March 6,
## #   2015` <dbl>, `March 13, 2015` <dbl>, `March 27, 2015` <chr>, `April 3,
## #   2015` <dbl>, `April 10, 2015` <dbl>, `April 24, 2015` <dbl>, `May 3,
## #   2015` <dbl>, `May 8, 2015` <dbl>, `May 22, 2015` <dbl>, `June 5,
## #   2015` <dbl>, `June 19, 2015` <dbl>, `July 3, 2015` <chr>, `July 17,
## #   2015` <dbl>, `July 31, 2015` <dbl>, `August 7, 2015` <dbl>, `August
## #   14, 2015` <dbl>, `August 28, 2015` <dbl>, `September 4, 2015` <dbl>,
## #   `September 11, 2015` <dbl>, `September 25, 2015` <dbl>, `October 2,
## #   2015` <dbl>, `October 9, 2015` <dbl>, `October 23, 2015` <dbl>,
## #   `November 6, 2015` <chr>, `November 20, 2015` <dbl>, `December 4,
## #   2015` <dbl>, `December 18, 2015` <chr>, `January 6, 2016` <chr>,
## #   `January 22, 2016` <dbl>, `February 5, 2016` <chr>, `February 19,
## #   2016` <dbl>, `March 4, 2016` <dbl>, `March 18, 2016` <dbl>, `April 1,
## #   2016` <chr>, `April 14, 2016` <dbl>, `April 28, 2016` <dbl>, `May 6,
## #   2016` <dbl>, `May 12, 2016` <dbl>, `May 27, 2016` <dbl>, `June 3,
## #   2016` <dbl>, `June 10, 2016` <dbl>, `June 24, 2016` <dbl>, `July 1,
## #   2016` <dbl>, `July 8, 2016` <dbl>, `July 22, 2016` <dbl>, `August 5,
## #   2016` <chr>, `August 19, 2016` <chr>, `September 2, 2016` <chr>,
## #   `September 16, 2016` <dbl>, `September 29, 2016` <dbl>, `October 7,
## #   2016` <dbl>, `October 14, 2016` <dbl>, `October 28, 2016` <dbl>,
## #   `November 4, 2016` <dbl>, `November 11, 2016` <dbl>, `December 2,
## #   2016` <dbl>, `December 9, 2016` <dbl>, `December 23, 2016` <dbl>,
## #   `January 6, 2017` <dbl>, `January 20, 2017` <dbl>, `February 3,
## #   2017` <dbl>, `February 17, 2017` <dbl>, `March 3, 2017` <dbl>, `March
## #   17, 2017` <dbl>, `April 7, 2017` <chr>, ...

“Gather” the data

The gather() function is used to shift data that is a column header into a single row. This converts for example the column headers Dec 13, 2013, December 27th, 2013, January 12th, 2014, etc from column headers into a single column named “date.” The corresponding weights are matched to the appropriate animals and into a new column named “weight.” We can see that now the date and corresponding weights are matched into new columns named date/weight.

### Gather the data from columns into a single row for tidy data
wt_gather <- weights %>% gather(date, weight, 5:117)
head(wt_gather)
## # A tibble: 6 x 6
##   `ANIMAL ID`       DOB    SEX GENOTYPE              date weight
##         <chr>     <chr>  <chr>    <chr>             <chr>  <chr>
## 1      749162 7/31/2013   male       wt December 13, 2013   32.3
## 2      943492 7/31/2013   male       wt December 13, 2013   25.9
## 3      914784 7/31/2013   male       wt December 13, 2013   34.5
## 4       48711 7/31/2013 female       wt December 13, 2013   24.9
## 5      722836 7/31/2013 female       wt December 13, 2013   21.8
## 6      869852 7/31/2013 female       wt December 13, 2013   25.5

Convert dates to machine-readable

We can then convert the dates from the December 13th, 2013 format into a 12/13/13 (mdy) format that is machine-readable.

### Convert the dates into machine-readable month-day-year dates
wt_gather$date <- mdy(wt_gather$date)
wt_gather$DOB <- mdy(wt_gather$DOB)

Calculate the time difference between birth date and date weight taken

Using the timediff function behind the scenes we can create a new column (age_wt) that calculates the number of days between the date of the weight and the date of birth. We can exclude the dates that are negative. Additionally we convert weight to numeric, as it was originally imported as a character (eg letter).

### Calculate the difference in days between the date of birth (DOB)
### and the specific date the weight was taken
### convert weight to numeric instead of character
wt_age <- wt_gather %>% 
  mutate(age_wt = date - DOB) %>% 
  filter(age_wt > 0)
wt_age$weight <- as.numeric(wt_age$weight)
## Warning: NAs introduced by coercion
head(wt_age)
## # A tibble: 6 x 7
##   `ANIMAL ID`        DOB    SEX GENOTYPE       date weight   age_wt
##         <chr>     <date>  <chr>    <chr>     <date>  <dbl>   <time>
## 1      749162 2013-07-31   male       wt 2013-12-13   32.3 135 days
## 2      943492 2013-07-31   male       wt 2013-12-13   25.9 135 days
## 3      914784 2013-07-31   male       wt 2013-12-13   34.5 135 days
## 4       48711 2013-07-31 female       wt 2013-12-13   24.9 135 days
## 5      722836 2013-07-31 female       wt 2013-12-13   21.8 135 days
## 6      869852 2013-07-31 female       wt 2013-12-13   25.5 135 days

Time binning

To associate the days w/ number of months between weights we can pull the wt_age (time between weight date/DOB) and then “bin” it into 30 day increments. This is slightly messy for use in statistics, as 31 days = 2 months, while 30 days = 1 month, but it works perfectly for graphing. We then add a new column to our original dataframe that incorporates the matching time-bins to the animals. In this example, 135 days = 4 months + 15 days = 5 months.

### create a vector and then bin
age_vec <- wt_age$age_wt
bin_x <- seq(0, 1530, 30)
bin_vec <- .bincode(age_vec, bin_x, TRUE, TRUE)


wt_age["bin"] <- bin_vec

head(wt_age)
## # A tibble: 6 x 8
##   `ANIMAL ID`        DOB    SEX GENOTYPE       date weight   age_wt   bin
##         <chr>     <date>  <chr>    <chr>     <date>  <dbl>   <time> <int>
## 1      749162 2013-07-31   male       wt 2013-12-13   32.3 135 days     5
## 2      943492 2013-07-31   male       wt 2013-12-13   25.9 135 days     5
## 3      914784 2013-07-31   male       wt 2013-12-13   34.5 135 days     5
## 4       48711 2013-07-31 female       wt 2013-12-13   24.9 135 days     5
## 5      722836 2013-07-31 female       wt 2013-12-13   21.8 135 days     5
## 6      869852 2013-07-31 female       wt 2013-12-13   25.5 135 days     5

Summarize the date by sex, genotype, and time bin

Now that we have a nice tidy data set we can exclude animals w/ unknown genotypes, weights that don’t exist, or outliers (one animal has a bodyweight of 2708 g). Our summary stats are grouped by sex, genotype, and then time bin. The summary stats include the N and the mean weight. From the header we can see the new dataframe, which now has headers for sex, gt, time bin, n, and then weight. It is origin/systat ready, or we can keep going in R for preliminary graphs.

### start the summarize pipe, exclude unknown genotypes, NA weights or outliers (> 55)
### group by SEX, GENOTYPE, and by the age in days then calc the mean
wt_sum <- wt_age %>% 
  filter(GENOTYPE != "NA", GENOTYPE != "?", weight != "NA", weight < 55) %>% 
  group_by(SEX, GENOTYPE, bin) %>% 
  summarise(n = n(), weight = mean(weight, na.rm = T))

### view the end results
wt_sum
## # A tibble: 148 x 5
## # Groups:   SEX, GENOTYPE [?]
##       SEX GENOTYPE   bin     n   weight
##     <chr>    <chr> <int> <int>    <dbl>
##  1 female      het     1    26 14.50769
##  2 female      het     2   251 16.90837
##  3 female      het     3   304 18.96283
##  4 female      het     4   296 20.26081
##  5 female      het     5   261 21.23908
##  6 female      het     6   165 21.87152
##  7 female      het     7   140 22.56571
##  8 female      het     8   119 23.11597
##  9 female      het     9   120 23.41667
## 10 female      het    10   112 23.60357
## # ... with 138 more rows

Preliminary graph

We can graph the prelim data, and facet by sex/genotype. I applied a smoothed line of best fit, not a true correlation but a moving average equivalent.

### Faceted graph by genotype and sex
(g1 <- ggplot(data = wt_sum, aes(x = bin, y = weight, color = GENOTYPE)) + 
    geom_point(size = 6, alpha = 0.2) + 
    geom_smooth(alpha = .5, size = 1.5) + 
    facet_wrap(~SEX, nrow = 2) + 
    labs(x = "Age in months", y = "Body weight in grams") + 
    scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30)))
## `geom_smooth()` using method = 'loess'

These last two steps would save the graph and the data file into excel.

save the plot

ggsave(“weight_age.png”, width = 20, height = 20, units = “cm”)

Save table to CSV file

write.csv(wt_age, “wt_age.csv”)