Transforming your data into a binomial data

In my work on agricultural research, I always face times where I need to do a logistic regression analysis to understand my data and answer the questions that I need.

Some examples of those analyses can be a mortality dose-response analysis where my data refers to a dataset where there is usually a factor of two levels: “Dead” or “Alive” (sometimes another factor can be present such as “Damaged”). Another case is when you are counting the presence of seeds, nuts or even plants as your response to treatment.

When doing those kinds of experiments sometimes, you receive a spreadsheet containing just numbers such as “47 plants of 100 died on the plot 103” for instance as observed above. With this type of data, you cannot run a binomial logistic regression which sometimes may be your goal.

## 
## ---------------------------------
##  Treatment   plot   died   total 
## ----------- ------ ------ -------
##      1       101     0      100  
## 
##      2       102     25     100  
## 
##      3       103     47     100  
## 
##      4       104     99     100  
## 
##      3       201     67     100  
## 
##      4       202    100     100  
## 
##      2       203     34     100  
## 
##      1       204     0      100  
## 
##      2       301     29     100  
## 
##      3       302     62     100  
## 
##      1       303     0      100  
## 
##      4       304     95     100  
## ---------------------------------

Transforming this data by hand can be a challenge depending on the size of your dataset and the number of observations that you have. With that in mind, I will demonstrate how to transform this data into binomial data.

For doing this, I will use part of the dataset from one of my fellow graduate students Larissa Larocca. This dataset refers to the effects of some herbicides for Sucker control in hazelnuts on nut fall timing.

Lets first give a look at the data structure using glimpse() and head() and open the necessary packages

library(readxl) # this package read excel files
library(tidyverse) # this package open all the tidyverse packages
library(lubridate) # date transformation package
data <- read_excel("Amity-Nuts-2018.xlsx")

head(data) # open a data first 6 lines

## # A tibble: 6 x 6
##    plot date       total.A total.B     A     B
##   <dbl> <chr>        <dbl>   <dbl> <dbl> <dbl>
## 1   101 2018.08.31      51      52    51    52
## 2   102 2018.08.31      54      51    54    51
## 3   103 2018.08.31      51      52    51    52
## 4   104 2018.08.31      52      53    52    53
## 5   105 2018.08.31      51      50    51    50
## 6   106 2018.08.31      52      52    52    52

On this data you have the variables:

Plot: refers to a specific replication of treatment (each plot contain 2 trees);
Date: refers to the date where the nuts were counted;
Total.A: refers to the initial amount of nuts in the Tree A in the plot;
Total.B: refers to the initial amount of nuts in the Tree B in the plot;
A: refers to the nuts counted in tree A;
B: refers to the nuts counted in tree B;

First thing this data needs to be set into a tidy version before the transformation:

x <- data %>%
  gather(key = tree, value = nuts, A:B) # create collumn tree

y <- data %>%
  gather(key = tree, value = total, total.A:total.B) %>%
  select(-c(A,B)) %>%
  separate(tree, c("discart", "tree")) %>%
  select(-discart) %>%
  full_join(x) %>%
  select(-c(total.A,total.B)) # create a column for initial total number and join with x
y$tree <- factor(y$tree) #make tree a factor
data.tidy <- y
summary(data.tidy) # check data summary

##       plot         date           tree        total            nuts      
##  Min.   :101   Length:936         A:468   Min.   :49.00   Min.   : 0.00  
##  1st Qu.:178   Class :character   B:468   1st Qu.:50.00   1st Qu.: 0.00  
##  Median :255   Mode  :character           Median :51.00   Median : 6.00  
##  Mean   :255                              Mean   :51.44   Mean   :17.49  
##  3rd Qu.:332                              3rd Qu.:52.00   3rd Qu.:37.25  
##  Max.   :409                              Max.   :55.00   Max.   :55.00

head(data.tidy)

## # A tibble: 6 x 5
##    plot date       tree  total  nuts
##   <dbl> <chr>      <fct> <dbl> <dbl>
## 1   101 2018.08.31 A        51    51
## 2   102 2018.08.31 A        54    54
## 3   103 2018.08.31 A        51    51
## 4   104 2018.08.31 A        52    52
## 5   105 2018.08.31 A        51    51
## 6   106 2018.08.31 A        52    52

On the code above we create a column for “Tree” and join together the A and B values into nuts and the totals into one column. This results in a dataset containing 936 observations.

After tidying the data, we need to separate this in two datasets for each tree by using the filter() function. After this, we will need to create an observation for each nut of each plot by using the rep() function which takes as the first argument what you want to repeat and as the second argument how many times you want to repeat it.

tree.A <- data.tidy %>%
  filter(tree == "A") #filter the data for tree A

tree.A <- tree.A[rep(row.names(tree.A), tree.A$total), 1:5] # create a row for each total nut on each plot
cols <- c("plot", "tree") #select the columns to transform to factor
tree.A[cols] <- lapply(tree.A[cols], factor) # use lapply to transform columns to factor
summary(tree.A) # check the summary

##       plot           date           tree          total         nuts      
##  102    :  702   Length:23855       A:23855   Min.   :49   Min.   : 0.00  
##  109    :  689   Class :character             1st Qu.:50   1st Qu.: 0.00  
##  408    :  689   Mode  :character             Median :51   Median : 6.00  
##  104    :  676                                Mean   :51   Mean   :17.11  
##  106    :  676                                3rd Qu.:52   3rd Qu.:39.00  
##  201    :  676                                Max.   :54   Max.   :54.00  
##  (Other):19747

tree.B <- data.tidy %>%
  filter(tree == "B") # filter the data for tree B

tree.B <- tree.B[rep(row.names(tree.B), tree.B$total), 1:5] # create a row for each total nut on each plot
tree.B[cols] <- lapply(tree.B[cols], factor) # use lapply to transform columns to factor
summary(tree.B) # check the summary

##       plot           date           tree          total      
##  107    :  715   Length:24297       B:24297   Min.   :50.00  
##  309    :  715   Class :character             1st Qu.:51.00  
##  403    :  702   Mode  :character             Median :52.00  
##  104    :  689                                Mean   :51.95  
##  109    :  689                                3rd Qu.:53.00  
##  201    :  689                                Max.   :55.00  
##  (Other):20098                                               
##       nuts      
##  Min.   : 0.00  
##  1st Qu.: 0.00  
##  Median : 8.00  
##  Mean   :17.86  
##  3rd Qu.:37.00  
##  Max.   :55.00  
##

After creating a line for each nut on each tree dataset, we need to bind those two datasets together again. For doing this, we can use the function full_join() from the Dplyr package:

#join both tree data

all <- full_join(tree.A,tree.B) # join the two dataset together
all$tree <- factor(all$tree) # transform the column tree back to factor
all$date <- ymd(all$date) %>% force_tz(tzone = "PST") # make the date column into date format at the PST time zone
summary(all) # check the summary

##       plot            date            tree          total      
##  309    : 1391   Min.   :2018-08-31   A:23855   Min.   :49.00  
##  109    : 1378   1st Qu.:2018-09-28   B:24297   1st Qu.:50.00  
##  403    : 1378   Median :2018-10-09             Median :51.00  
##  102    : 1365   Mean   :2018-10-05             Mean   :51.48  
##  104    : 1365   3rd Qu.:2018-10-19             3rd Qu.:52.00  
##  107    : 1365   Max.   :2018-10-30             Max.   :55.00  
##  (Other):39910                                                 
##       nuts      
##  Min.   : 0.00  
##  1st Qu.: 0.00  
##  Median : 6.00  
##  Mean   :17.49  
##  3rd Qu.:37.00  
##  Max.   :55.00  
##

head(all) # check the first 6 observations

## # A tibble: 6 x 5
##   plot  date       tree  total  nuts
##   <fct> <date>     <fct> <dbl> <dbl>
## 1 101   2018-08-31 A        51    51
## 2 101   2018-08-31 A        51    51
## 3 101   2018-08-31 A        51    51
## 4 101   2018-08-31 A        51    51
## 5 101   2018-08-31 A        51    51
## 6 101   2018-08-31 A        51    51

Notice that now our dataset have 48,152 observations. Much larger than our initial dataset!

Since now we have a row for each nut counted, we can start to add our binomial factor (0 or 1) as a new column.

First, we calculate the number of plots that no nut fell on each row and save the result as a vector called “nonuts”:

#add bionomial values to the dataset

nonuts <- data.tidy$total - data.tidy$nuts # calculate the number of nuts that felt

After that, we will create two datasets of lists containing the binomial value for each nut. To do that we will use the lapply() function to pass the function rep() on each row to add the value 1 on each row that contain a nut, and will do the same thing on the nonuts vector to create a dataset of lists of nuts that already fell by adding the value 0 to it:

results.1 <- lapply(data.tidy$nuts, rep, x = 1) # create a dataset of lists with the nuts that did not fell
results.0 <- lapply(nonuts, rep, x = 0) # create a dataset of lists with nuts that already fell

After this step we need to join this dataset lists together at the correct order; to do it we will create our own function called join.list() as above

join.list <- function(list.1,list.2, i) {
  c(list.1[[i]], list.2[[i]])
} # create a function that join each results from two lists

With this function, we can use the lapply function to join both datasets of lists into one that has the same length as our tidy.data and them unlist it to make it one single vector:

results <- lapply(1:936, join.list, list.1 = results.0, list.2 = results.1) # add here the number of observations in df "data.tidy"

results <- unlist(results) # unlist results to make a single vector

Now is time to add this vector as a column in our final dataset and finally get a binomial data. But remember, both of them need to have the same length to make sure of it by checking the length of the vector you just created:

length(results) == length(all$nuts) # compare the lengths of the dataset and vector

## [1] TRUE

If the result is TRUE you can correctly add this vector as a column into the all dataset that we created previously and create :

final.data <- all %>% mutate(nut = results) %>%
  mutate(nut.presence = if_else(nut == 1, TRUE, FALSE)) # add a T or F collumn and mutate results into all dataset

head(final.data)

## # A tibble: 6 x 7
##   plot  date       tree  total  nuts   nut nut.presence
##   <fct> <date>     <fct> <dbl> <dbl> <dbl> <lgl>       
## 1 101   2018-08-31 A        51    51     1 TRUE        
## 2 101   2018-08-31 A        51    51     1 TRUE        
## 3 101   2018-08-31 A        51    51     1 TRUE        
## 4 101   2018-08-31 A        51    51     1 TRUE        
## 5 101   2018-08-31 A        51    51     1 TRUE        
## 6 101   2018-08-31 A        51    51     1 TRUE

summary(final.data)

##       plot            date            tree          total      
##  309    : 1391   Min.   :2018-08-31   A:23855   Min.   :49.00  
##  109    : 1378   1st Qu.:2018-09-28   B:24297   1st Qu.:50.00  
##  403    : 1378   Median :2018-10-09             Median :51.00  
##  102    : 1365   Mean   :2018-10-05             Mean   :51.48  
##  104    : 1365   3rd Qu.:2018-10-19             3rd Qu.:52.00  
##  107    : 1365   Max.   :2018-10-30             Max.   :55.00  
##  (Other):39910                                                 
##       nuts            nut         nut.presence   
##  Min.   : 0.00   Min.   :0.0000   Mode :logical  
##  1st Qu.: 0.00   1st Qu.:0.0000   FALSE:31783    
##  Median : 6.00   Median :0.0000   TRUE :16369    
##  Mean   :17.49   Mean   :0.3399                  
##  3rd Qu.:37.00   3rd Qu.:1.0000                  
##  Max.   :55.00   Max.   :1.0000                  
##

write.csv(final.data, file = "nut.count.binomial.csv")

With that, you finally have your dataset as binomial and ready for a binomial logistic analysis!

Transforming your data into a binomial data

Lucas Kopecky Bobadilla

November 28, 2018