Data organization

Raquel Divieso

25/11/2021

Here is the organization documentation of the data traits related with rarity in mammals species, sourse of data analisis of “Exploring phylogenetic and spatial patterns of rarity and their correlates”.

Let’s start opening and organizing COMBINE data from Soria et al., 2021:

This data set contains the most recent and complete compilation of global mammals traits from published sources. This data can be downloaded here.

The traits present in the data useful to determine rarity in mammals are:

  • adult_mass_g: body size can be useful to abundance data analysis and interpretation;
  • litter_size_n: number of offspring born per litter;
  • litter_per_year_n: number of litter per year;
  • interbirth_interval_d: days between births;
  • density_n_km2: number of individuals of specie per kilometer;
  • det_diet_breadth: Elton traits dietary categories breadth;
  • trophic_level: classification in herbivorous, omnivorous or carnivorous;
  • habitat_breadth: number of distinct suitable level 1 IUCN habitats.
library(tidyverse)
library(ggplot2)
library(tidyr)
library(rpubs)


setwd("C:/Users/raque/OneDrive - ufpr.br/Doutorado/3ºCap/data_s")

comb<- read.csv("trait_data_imputed.csv")
comb<- select(comb, order, genus, species,  family, iucn2020_binomial, adult_mass_g, litter_size_n, litters_per_year_n, interbirth_interval_d, density_n_km2, det_diet_breadth_n, trophic_level, habitat_breadth_n)

Note that we will use the taxonomy of IUCN 2020, because the ranges sizes will be obtained from this database. Let’s see if there are duplicated registers from each species:

anyDuplicated(comb$iucn2020_binomial)
## [1] 87
# yes, there is, let's see what's happening:
comb1<-data.frame(comb %>% 
  group_by(iucn2020_binomial) %>%
  filter(n()>1)) 
#There are many "Not recognised" species from IUCN, they have to be in a separated data to solve this later checking one by one (and removed from our main dataset now):

write.csv(comb %>% filter(iucn2020_binomial=="Not recognised"), "Not_recognised_spp_COMBINE.csv")
comb<-comb %>% filter(!iucn2020_binomial=="Not recognised")
#done

Now, we can see only the data with more then one row by specie and check if there is some difference between the values for all traits between the duplicated names:

comb2<-data.frame(comb %>% 
                    group_by(iucn2020_binomial) %>%
                    filter(n()>1))

dif_values<-matrix(ncol= 2, nrow= 13)
colnames(dif_values)<- c("trait", "is_diferent")
minus <- function(x) sum(x[1],na.rm=T) - sum(x[2],na.rm=T)
for(i in 6:13){
t<-aggregate(comb2[i], by=list(comb2$iucn2020_binomial), FUN = minus)
dif_values[i,1]<- colnames(comb2[i])
dif_values[i,2]<- any(t[2]>0)
}
dif_values<- dif_values[-1:-5,]
This three traits have differences in the values between the same specie:
- litter_size_n    
- litters_per_year_n  
- interbirth_interval_d
comb3<-comb2  %>% select(iucn2020_binomial, litter_size_n, litters_per_year_n, interbirth_interval_d)
# write.csv(comb3, "different_values.csv")

Looking at the values in the table, we can see that the differences are very small (in the second or third decimal places) so the best way to solve this is to make a average of the values between the same traits for the same species (22 spp).

#Unique data
comb_u<-data.frame(comb %>% 
                     group_by(iucn2020_binomial) %>%
                     filter(!n()>1))

#Duplicated data
comb_d<-data.frame(comb %>% 
                     group_by(iucn2020_binomial) %>%
                     filter(n()>1))

comb_d2<-as.data.frame(matrix(ncol= dim(comb_d)[2], nrow= length(unique(comb_d$iucn2020_binomial))))

for(i in 6:13){
v<-aggregate(comb_d[i], by=list(comb_d$iucn2020_binomial), FUN = mean)
comb_d2[,i]<- v[2]}
v$V14<-v[1]

nam<-select(comb_d, order, genus, species,  family, iucn2020_binomial)
nam<- nam %>% distinct()
comb_d2[1:5]<- nam[1:5]

colnames(comb_d2)<-colnames(comb_u)
comb<-rbind(comb_u,comb_d2)

# write.csv(comb, "COMBINE_without_dup.csv")

Let’s confirm if there is any duplicated value:

any(duplicated(comb$iucn2020_binomial)) 
## [1] FALSE
#it's ok!

Now, with the clean dataset we can see how many data we have for each trait and for how many species we have all traits information:

# For how much species we have all data:
dim(na.omit(comb))[1]
## [1] 1226
# How many species data for each trait:
data.frame(colSums(!is.na(comb)))
##                       colSums..is.na.comb..
## order                                  5961
## genus                                  5961
## species                                5961
## family                                 5961
## iucn2020_binomial                      5961
## adult_mass_g                           5744
## litter_size_n                          5808
## litters_per_year_n                     5806
## interbirth_interval_d                  5803
## density_n_km2                          1249
## det_diet_breadth_n                     5794
## trophic_level                          5810
## habitat_breadth_n                      5631

The trait with more missing information is the density_n_km², one of the more important for our study. So, we need to do a active search for this trait in other databases [continue].