In my work on agricultural research, I always face times where I need to do a logistic regression analysis to understand my data and answer the questions that I need.
Some examples of those analyses can be a mortality dose-response analysis where my data refers to a dataset where there is usually a factor of two levels: “Dead” or “Alive” (sometimes another factor can be present such as “Damaged”). Another case is when you are counting the presence of seeds, nuts or even plants as your response to treatment.
When doing those kinds of experiments sometimes, you receive a spreadsheet containing just numbers such as “47 plants of 100 died on the plot 103” for instance as observed above. With this type of data, you cannot run a binomial logistic regression which sometimes may be your goal.
##
## ---------------------------------
## Treatment plot died total
## ----------- ------ ------ -------
## 1 101 0 100
##
## 2 102 25 100
##
## 3 103 47 100
##
## 4 104 99 100
##
## 3 201 67 100
##
## 4 202 100 100
##
## 2 203 34 100
##
## 1 204 0 100
##
## 2 301 29 100
##
## 3 302 62 100
##
## 1 303 0 100
##
## 4 304 95 100
## ---------------------------------
Transforming this data by hand can be a challenge depending on the size of your dataset and the number of observations that you have. With that in mind, I will demonstrate how to transform this data into binomial data.
For doing this, I will use part of the dataset from one of my fellow graduate students Larissa Larocca. This dataset refers to the effects of some herbicides for Sucker control in hazelnuts on nut fall timing.
Lets first give a look at the data structure using glimpse() and head() and open the necessary packages
library(readxl) # this package read excel files
library(tidyverse) # this package open all the tidyverse packages
library(lubridate) # date transformation package
data <- read_excel("Amity-Nuts-2018.xlsx")
head(data) # open a data first 6 lines
## # A tibble: 6 x 6
## plot date total.A total.B A B
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 101 2018.08.31 51 52 51 52
## 2 102 2018.08.31 54 51 54 51
## 3 103 2018.08.31 51 52 51 52
## 4 104 2018.08.31 52 53 52 53
## 5 105 2018.08.31 51 50 51 50
## 6 106 2018.08.31 52 52 52 52
On this data you have the variables:
First thing this data needs to be set into a tidy version before the transformation:
x <- data %>%
gather(key = tree, value = nuts, A:B) # create collumn tree
y <- data %>%
gather(key = tree, value = total, total.A:total.B) %>%
select(-c(A,B)) %>%
separate(tree, c("discart", "tree")) %>%
select(-discart) %>%
full_join(x) %>%
select(-c(total.A,total.B)) # create a column for initial total number and join with x
y$tree <- factor(y$tree) #make tree a factor
data.tidy <- y
summary(data.tidy) # check data summary
## plot date tree total nuts
## Min. :101 Length:936 A:468 Min. :49.00 Min. : 0.00
## 1st Qu.:178 Class :character B:468 1st Qu.:50.00 1st Qu.: 0.00
## Median :255 Mode :character Median :51.00 Median : 6.00
## Mean :255 Mean :51.44 Mean :17.49
## 3rd Qu.:332 3rd Qu.:52.00 3rd Qu.:37.25
## Max. :409 Max. :55.00 Max. :55.00
head(data.tidy)
## # A tibble: 6 x 5
## plot date tree total nuts
## <dbl> <chr> <fct> <dbl> <dbl>
## 1 101 2018.08.31 A 51 51
## 2 102 2018.08.31 A 54 54
## 3 103 2018.08.31 A 51 51
## 4 104 2018.08.31 A 52 52
## 5 105 2018.08.31 A 51 51
## 6 106 2018.08.31 A 52 52
On the code above we create a column for “Tree” and join together the A and B values into nuts and the totals into one column. This results in a dataset containing 936 observations.
After tidying the data, we need to separate this in two datasets for each tree by using the filter() function. After this, we will need to create an observation for each nut of each plot by using the rep() function which takes as the first argument what you want to repeat and as the second argument how many times you want to repeat it.
tree.A <- data.tidy %>%
filter(tree == "A") #filter the data for tree A
tree.A <- tree.A[rep(row.names(tree.A), tree.A$total), 1:5] # create a row for each total nut on each plot
cols <- c("plot", "tree") #select the columns to transform to factor
tree.A[cols] <- lapply(tree.A[cols], factor) # use lapply to transform columns to factor
summary(tree.A) # check the summary
## plot date tree total nuts
## 102 : 702 Length:23855 A:23855 Min. :49 Min. : 0.00
## 109 : 689 Class :character 1st Qu.:50 1st Qu.: 0.00
## 408 : 689 Mode :character Median :51 Median : 6.00
## 104 : 676 Mean :51 Mean :17.11
## 106 : 676 3rd Qu.:52 3rd Qu.:39.00
## 201 : 676 Max. :54 Max. :54.00
## (Other):19747
tree.B <- data.tidy %>%
filter(tree == "B") # filter the data for tree B
tree.B <- tree.B[rep(row.names(tree.B), tree.B$total), 1:5] # create a row for each total nut on each plot
tree.B[cols] <- lapply(tree.B[cols], factor) # use lapply to transform columns to factor
summary(tree.B) # check the summary
## plot date tree total
## 107 : 715 Length:24297 B:24297 Min. :50.00
## 309 : 715 Class :character 1st Qu.:51.00
## 403 : 702 Mode :character Median :52.00
## 104 : 689 Mean :51.95
## 109 : 689 3rd Qu.:53.00
## 201 : 689 Max. :55.00
## (Other):20098
## nuts
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 8.00
## Mean :17.86
## 3rd Qu.:37.00
## Max. :55.00
##
After creating a line for each nut on each tree dataset, we need to bind those two datasets together again. For doing this, we can use the function full_join() from the Dplyr package:
#join both tree data
all <- full_join(tree.A,tree.B) # join the two dataset together
all$tree <- factor(all$tree) # transform the column tree back to factor
all$date <- ymd(all$date) %>% force_tz(tzone = "PST") # make the date column into date format at the PST time zone
summary(all) # check the summary
## plot date tree total
## 309 : 1391 Min. :2018-08-31 A:23855 Min. :49.00
## 109 : 1378 1st Qu.:2018-09-28 B:24297 1st Qu.:50.00
## 403 : 1378 Median :2018-10-09 Median :51.00
## 102 : 1365 Mean :2018-10-05 Mean :51.48
## 104 : 1365 3rd Qu.:2018-10-19 3rd Qu.:52.00
## 107 : 1365 Max. :2018-10-30 Max. :55.00
## (Other):39910
## nuts
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 6.00
## Mean :17.49
## 3rd Qu.:37.00
## Max. :55.00
##
head(all) # check the first 6 observations
## # A tibble: 6 x 5
## plot date tree total nuts
## <fct> <date> <fct> <dbl> <dbl>
## 1 101 2018-08-31 A 51 51
## 2 101 2018-08-31 A 51 51
## 3 101 2018-08-31 A 51 51
## 4 101 2018-08-31 A 51 51
## 5 101 2018-08-31 A 51 51
## 6 101 2018-08-31 A 51 51
Notice that now our dataset have 48,152 observations. Much larger than our initial dataset!
Since now we have a row for each nut counted, we can start to add our binomial factor (0 or 1) as a new column.
First, we calculate the number of plots that no nut fell on each row and save the result as a vector called “nonuts”:
#add bionomial values to the dataset
nonuts <- data.tidy$total - data.tidy$nuts # calculate the number of nuts that felt
After that, we will create two datasets of lists containing the binomial value for each nut. To do that we will use the lapply() function to pass the function rep() on each row to add the value 1 on each row that contain a nut, and will do the same thing on the nonuts vector to create a dataset of lists of nuts that already fell by adding the value 0 to it:
results.1 <- lapply(data.tidy$nuts, rep, x = 1) # create a dataset of lists with the nuts that did not fell
results.0 <- lapply(nonuts, rep, x = 0) # create a dataset of lists with nuts that already fell
After this step we need to join this dataset lists together at the correct order; to do it we will create our own function called join.list() as above
join.list <- function(list.1,list.2, i) {
c(list.1[[i]], list.2[[i]])
} # create a function that join each results from two lists
With this function, we can use the lapply function to join both datasets of lists into one that has the same length as our tidy.data and them unlist it to make it one single vector:
results <- lapply(1:936, join.list, list.1 = results.0, list.2 = results.1) # add here the number of observations in df "data.tidy"
results <- unlist(results) # unlist results to make a single vector
Now is time to add this vector as a column in our final dataset and finally get a binomial data. But remember, both of them need to have the same length to make sure of it by checking the length of the vector you just created:
length(results) == length(all$nuts) # compare the lengths of the dataset and vector
## [1] TRUE
If the result is TRUE you can correctly add this vector as a column into the all dataset that we created previously and create :
final.data <- all %>% mutate(nut = results) %>%
mutate(nut.presence = if_else(nut == 1, TRUE, FALSE)) # add a T or F collumn and mutate results into all dataset
head(final.data)
## # A tibble: 6 x 7
## plot date tree total nuts nut nut.presence
## <fct> <date> <fct> <dbl> <dbl> <dbl> <lgl>
## 1 101 2018-08-31 A 51 51 1 TRUE
## 2 101 2018-08-31 A 51 51 1 TRUE
## 3 101 2018-08-31 A 51 51 1 TRUE
## 4 101 2018-08-31 A 51 51 1 TRUE
## 5 101 2018-08-31 A 51 51 1 TRUE
## 6 101 2018-08-31 A 51 51 1 TRUE
summary(final.data)
## plot date tree total
## 309 : 1391 Min. :2018-08-31 A:23855 Min. :49.00
## 109 : 1378 1st Qu.:2018-09-28 B:24297 1st Qu.:50.00
## 403 : 1378 Median :2018-10-09 Median :51.00
## 102 : 1365 Mean :2018-10-05 Mean :51.48
## 104 : 1365 3rd Qu.:2018-10-19 3rd Qu.:52.00
## 107 : 1365 Max. :2018-10-30 Max. :55.00
## (Other):39910
## nuts nut nut.presence
## Min. : 0.00 Min. :0.0000 Mode :logical
## 1st Qu.: 0.00 1st Qu.:0.0000 FALSE:31783
## Median : 6.00 Median :0.0000 TRUE :16369
## Mean :17.49 Mean :0.3399
## 3rd Qu.:37.00 3rd Qu.:1.0000
## Max. :55.00 Max. :1.0000
##
write.csv(final.data, file = "nut.count.binomial.csv")
With that, you finally have your dataset as binomial and ready for a binomial logistic analysis!