Here is the organization documentation of the data traits related with rarity in mammals species, sourse of data analisis of “Exploring phylogenetic and spatial patterns of rarity and their correlates”.
Let’s start opening and organizing COMBINE data from Soria et al., 2021:
This data set contains the most recent and complete compilation of global mammals traits from published sources. This data can be downloaded here.
The traits present in the data useful to determine rarity in mammals are:
- adult_mass_g: body size can be useful to abundance data analysis and interpretation;
- litter_size_n: number of offspring born per litter;
- litter_per_year_n: number of litter per year;
- interbirth_interval_d: days between births;
- density_n_km2: number of individuals of specie per kilometer;
- det_diet_breadth: Elton traits dietary categories breadth;
- trophic_level: classification in herbivorous, omnivorous or carnivorous;
- habitat_breadth: number of distinct suitable level 1 IUCN habitats.
library(tidyverse)
library(ggplot2)
library(tidyr)
library(rpubs)
setwd("C:/Users/raque/OneDrive - ufpr.br/Doutorado/3ºCap/data_s")
comb<- read.csv("trait_data_imputed.csv")
comb<- select(comb, order, genus, species, family, iucn2020_binomial, adult_mass_g, litter_size_n, litters_per_year_n, interbirth_interval_d, density_n_km2, det_diet_breadth_n, trophic_level, habitat_breadth_n)Note that we will use the taxonomy of IUCN 2020, because the ranges sizes will be obtained from this database. Let’s see if there are duplicated registers from each species:
## [1] 87
comb1<-data.frame(comb %>%
group_by(iucn2020_binomial) %>%
filter(n()>1))
#There are many "Not recognised" species from IUCN, they have to be in a separated data to solve this later checking one by one (and removed from our main dataset now):
write.csv(comb %>% filter(iucn2020_binomial=="Not recognised"), "Not_recognised_spp_COMBINE.csv")
comb<-comb %>% filter(!iucn2020_binomial=="Not recognised")
#doneNow, we can see only the data with more then one row by specie and check if there is some difference between the values for all traits between the duplicated names:
comb2<-data.frame(comb %>%
group_by(iucn2020_binomial) %>%
filter(n()>1))
dif_values<-matrix(ncol= 2, nrow= 13)
colnames(dif_values)<- c("trait", "is_diferent")
minus <- function(x) sum(x[1],na.rm=T) - sum(x[2],na.rm=T)
for(i in 6:13){
t<-aggregate(comb2[i], by=list(comb2$iucn2020_binomial), FUN = minus)
dif_values[i,1]<- colnames(comb2[i])
dif_values[i,2]<- any(t[2]>0)
}
dif_values<- dif_values[-1:-5,]This three traits have differences in the values between the same specie:
- litter_size_n
- litters_per_year_n
- interbirth_interval_d
comb3<-comb2 %>% select(iucn2020_binomial, litter_size_n, litters_per_year_n, interbirth_interval_d)
# write.csv(comb3, "different_values.csv")Looking at the values in the table, we can see that the differences are very small (in the second or third decimal places) so the best way to solve this is to make a average of the values between the same traits for the same species (22 spp).
#Unique data
comb_u<-data.frame(comb %>%
group_by(iucn2020_binomial) %>%
filter(!n()>1))
#Duplicated data
comb_d<-data.frame(comb %>%
group_by(iucn2020_binomial) %>%
filter(n()>1))
comb_d2<-as.data.frame(matrix(ncol= dim(comb_d)[2], nrow= length(unique(comb_d$iucn2020_binomial))))
for(i in 6:13){
v<-aggregate(comb_d[i], by=list(comb_d$iucn2020_binomial), FUN = mean)
comb_d2[,i]<- v[2]}
v$V14<-v[1]
nam<-select(comb_d, order, genus, species, family, iucn2020_binomial)
nam<- nam %>% distinct()
comb_d2[1:5]<- nam[1:5]
colnames(comb_d2)<-colnames(comb_u)
comb<-rbind(comb_u,comb_d2)
# write.csv(comb, "COMBINE_without_dup.csv")Let’s confirm if there is any duplicated value:
## [1] FALSE
Now, with the clean dataset we can see how many data we have for each trait and for how many species we have all traits information:
## [1] 1226
## colSums..is.na.comb..
## order 5961
## genus 5961
## species 5961
## family 5961
## iucn2020_binomial 5961
## adult_mass_g 5744
## litter_size_n 5808
## litters_per_year_n 5806
## interbirth_interval_d 5803
## density_n_km2 1249
## det_diet_breadth_n 5794
## trophic_level 5810
## habitat_breadth_n 5631
The trait with more missing information is the density_n_km², one of the more important for our study. So, we need to do a active search for this trait in other databases [continue].