Grouping Functions in BioDiv

Beginners

The GBIF data is huge and complex. Long version has around 235 fields and short version has around 110 fields. But as you would have experienced, most of the fields doesn’t have any importance. Merely around 5 or 10 fields are of much importance most of the time. Further, the fields are all jumbled up in no sensical order. There are no categorical divisions in the groups.

The following set of functions tries to eliminate that issue.

Let’s start from working on some use cases we might come across. I have a data set called australianMammals (Australian Mammals data of 235 columns and 1758193 rows). So let’s first try to categorize these columns into common groups as classified by GBIF.

# Note: What are these groups classified by GBIF?
    # 1) taxonClass - Taxonomic names, taxon name usages, or taxon concepts.
    # 2) locationClass - A spatial region or named place. Terms describing a place, whether named or not.
    # 3) eventClass - Information of an event (an action that occurs at a place and during a period of time)
    # 4) occurenceClass - Evidence of an occurrence in nature, or in a collection (specimen, observation, etc.)
    # 5) recordlevelTermsClass - Terms apply to the whole record regardless of the record type.
    # 6) geologicalContextClass - Information of a location within a geological context, such as stratigraphy.
    # 7) identificationClass - Taxonomic determinations (the assignment of a scientific name)
    # 8) resourceRelationshipClass
    # 9) measurementOrFactClass
    
# In other words, we can say,
    # taxonClass is taxonomical data
    # locationClass is spatial data
    # eventClass is temporal data
    
# For full list of fields in each groups, refer appendix
  1. So let’s do it,
allGroups <- addPartition(grouping = "all")
results <- applyPartitions(allGroups, australianMammals)

That’s it. simple as that. Let’s see how the results are created.

What happens is, in the first line we are creating a partition tree. This tree will decide how the final partition is done. Let’s see how the partition tree looks like.

print(allGroups)
##                        levelName
## 1  Partition Tree               
## 2   ¦--taxonClass               
## 3   ¦--locationClass            
## 4   ¦--eventClass               
## 5   ¦--occurenceClass           
## 6   ¦--recordlevelTermsClass    
## 7   ¦--geologicalContextClass   
## 8   ¦--identificationClass      
## 9   ¦--resourceRelationshipClass
## 10  °--measurementOrFactClass

So, the original data is divided into 10 partitions. In the second line, we are applying that partition to the data.

summary(results)
##                                          Length Class      Mode
## Partition Tree                           235    data.frame list
## Partition Tree/taxonClass                 29    data.frame list
## Partition Tree/locationClass              35    data.frame list
## Partition Tree/eventClass                 15    data.frame list
## Partition Tree/occurenceClass             20    data.frame list
## Partition Tree/recordlevelTermsClass      11    data.frame list
## Partition Tree/geologicalContextClass     18    data.frame list
## Partition Tree/identificationClass         8    data.frame list
## Partition Tree/resourceRelationshipClass   0    data.frame list
## Partition Tree/measurementOrFactClass      0    data.frame list

Here you can see, the result is a list of 10 dataframes. Each dataframe has different number of fields. let’s see what are the fields in eventClass.

names(results$'Partition Tree/eventClass')
##  [1] "eventID"           "samplingProtocol"  "samplingEffort"   
##  [4] "eventDate"         "eventTime"         "startDayOfYear"   
##  [7] "endDayOfYear"      "year"              "month"            
## [10] "day"               "verbatimEventDate" "habitat"          
## [13] "fieldNumber"       "fieldNotes"        "eventRemarks"

Here, all the fileds related to temporal aspects are grouped together, which is what we need.

  1. Now, if you don’t need a list of dataframes, but just a dataframe with related fields collected together, then use return parameter
results <- applyPartitions(allGroups, australianMammals, return = "dataframe")
dim(results)
## [1]   6 136
head(names(results) ,30)
##  [1] "taxonID"              "scientificNameID"     "acceptedNameUsageID" 
##  [4] "parentNameUsageID"    "originalNameUsageID"  "nameAccordingToID"   
##  [7] "namePublishedInID"    "taxonConceptID"       "scientificName"      
## [10] "acceptedNameUsage"    "parentNameUsage"      "originalNameUsage"   
## [13] "nameAccordingTo"      "namePublishedIn"      "namePublishedInYear" 
## [16] "higherClassification" "kingdom"              "phylum"              
## [19] "class"                "order"                "family"              
## [22] "genus"                "subgenus"             "specificEpithet"     
## [25] "infraspecificEpithet" "taxonRank"            "verbatimTaxonRank"   
## [28] "vernacularName"       "nomenclaturalCode"    "locationID"

if you see closely, all related fields are grouped together first 29 being taxonClass followed by locationClass. But important thing is you are getting a dataframe itself which is grouped accordingly within.

  1. Now, say you don’t need these all fields. You are just interested in main aspects of biodiversity data (spatial, temporal and taxon)
primaryGroups <- addPartition(grouping = "primary3")
results <- applyPartitions(primaryGroups, australianMammals)
summary(results)
##                              Length Class      Mode
## Partition Tree               235    data.frame list
## Partition Tree/taxonClass     29    data.frame list
## Partition Tree/locationClass  35    data.frame list
## Partition Tree/eventClass     15    data.frame list

So this returns only the fields related to these classes.

  1. Now, if you want to just single out one particular class then (let’s say taxon),
taxonGroup <- addPartition(grouping = "taxonClass")
print(taxonGroup)
##        levelName
## 1 Partition Tree
## 2  °--taxonClass
results <- applyPartitions(taxonGroup, australianMammals, return = "dataframe")
dim(results)
## [1]  6 29

Intermediates

The above section was entirely on how partitions are done based on columns. You can even add partition based on rows or even quality checks.

  1. For example, if you want data to be partitioned based a value of a column, (say you want records of Vulpes Frisch, 1775),
groups <- addPartition(grouping = "column", column = "scientificName~Vulpes Frisch, 1775", data = australianMammals)

Note here, the parameters have changed alot. when you make a partition based on a column, then grouping paramater shold be column. The column parameter should have the combination divided by a ~. Here it’s "scientificName~Vulpes Frisch, 1775". And the dataframe should be given too. Lets see how parttion looks like.

groups
##                                 levelName
## 1 Partition Tree                         
## 2  ¦--scientificName=Vulpes Frisch, 1775 
## 3  °--scientificName!=Vulpes Frisch, 1775
  1. Now, another interesting thing to note here, if you dont include anything in the second part of ~ (like "scientificName~"), then the partition will be all sets of unique scientificNames.
groups <- addPartition(grouping = "column", column = "scientificName~", data = australianMammals)
groups
##                                    levelName
## 1 Partition Tree                            
## 2  ¦--Oryctolagus cuniculus (Linnaeus, 1758)
## 3  ¦--Tachyglossus aculeatus (Shaw, 1792)   
## 4  ¦--Vombatus ursinus (Shaw, 1800)         
## 5  ¦--Vulpes Frisch, 1775                   
## 6  °--Macropus giganteus Shaw, 1790

Exciting huh?

  1. Now, instead of using a concrete row based partition (like column=columnValue), you can even run a cumulative function to partition based on rows. For example, if you don’t want any columns with more than 70% missing row values,
filledData<- addPartition(grouping = "greaterThan70")
print(filledData)
##           levelName
## 1 Partition Tree   
## 2  ¦--GreaterThan70
## 3  °--LessThan70

The function is intelligent. You can input any value in place of 70. It will work fine. And when you ask for greaterThan70, automatically lessThan70 is also added.

  1. You can add nested partitions in the partition tree too. Say you need records to be returned as spatial, temporal and taxon but in each groups you don’t need columns with more than 50% missing values,
primaryFilledData<- addPartition(grouping = "primary3")
addPartition(primaryFilledData, grouping = "greaterThan50")
##                levelName
## 1  Partition Tree       
## 2   ¦--taxonClass       
## 3   ¦   ¦--GreaterThan50
## 4   ¦   °--LessThan50   
## 5   ¦--locationClass    
## 6   ¦   ¦--GreaterThan50
## 7   ¦   °--LessThan50   
## 8   °--eventClass       
## 9       ¦--GreaterThan50
## 10      °--LessThan50

Here, we will get 6 dataframes as outputs.

  1. when you make a partition tree you have options to list all possible dataframes that will be returned.
listPartitions(primaryFilledData)
##  [1] "Partition Tree"                                  
##  [2] "Partition Tree -> taxonClass"                    
##  [3] "Partition Tree -> locationClass"                 
##  [4] "Partition Tree -> eventClass"                    
##  [5] "Partition Tree -> taxonClass -> GreaterThan50"   
##  [6] "Partition Tree -> taxonClass -> LessThan50"      
##  [7] "Partition Tree -> locationClass -> GreaterThan50"
##  [8] "Partition Tree -> locationClass -> LessThan50"   
##  [9] "Partition Tree -> eventClass -> GreaterThan50"   
## [10] "Partition Tree -> eventClass -> LessThan50"

If you want, you can delete any of the values here and give new vector as grouping list for the applypartition() function.

oldList <- listPartitions(primaryFilledData)
newList <- oldList[-c(4:7)]
newList
## [1] "Partition Tree"                               
## [2] "Partition Tree -> taxonClass"                 
## [3] "Partition Tree -> locationClass"              
## [4] "Partition Tree -> locationClass -> LessThan50"
## [5] "Partition Tree -> eventClass -> GreaterThan50"
## [6] "Partition Tree -> eventClass -> LessThan50"
newParttion <- applyPartitions(groupList = newList, data = australianMammals)
  1. If you check the list of outputs to be returned, it has 10 outputs instead of 6. This is because applying a partition tree returns dataframes of all possible nodes of the partition tree.

This is so because user gets the entire history of partition as actionable datasets so, it gives much flexibility. But if you don’t want that then use onlyLeaf parameter in both listPartitions() and applyPartitions()

listPartitions(primaryFilledData, onlyLeaf = TRUE)
## [1] "Partition Tree -> taxonClass -> GreaterThan50"   
## [2] "Partition Tree -> taxonClass -> LessThan50"      
## [3] "Partition Tree -> locationClass -> GreaterThan50"
## [4] "Partition Tree -> locationClass -> LessThan50"   
## [5] "Partition Tree -> eventClass -> GreaterThan50"   
## [6] "Partition Tree -> eventClass -> LessThan50"
  1. Whether you want 2 or more partitions to be in the same level or as nested, can be handled with applyTo parameter. For example,
groupsLeaf <- addPartition(grouping = "primary3")
addPartition(groupsLeaf, grouping = "greaterThan50", applyTo = "leaf")
##                levelName
## 1  Partition Tree       
## 2   ¦--taxonClass       
## 3   ¦   ¦--GreaterThan50
## 4   ¦   °--LessThan50   
## 5   ¦--locationClass    
## 6   ¦   ¦--GreaterThan50
## 7   ¦   °--LessThan50   
## 8   °--eventClass       
## 9       ¦--GreaterThan50
## 10      °--LessThan50

This is the defalut option. But the following is different.

groupsRoot <- addPartition(grouping = "primary3")
addPartition(groupsRoot, grouping = "greaterThan50", applyTo = "root")
##           levelName
## 1 Partition Tree   
## 2  ¦--taxonClass   
## 3  ¦--locationClass
## 4  ¦--eventClass   
## 5  ¦--GreaterThan50
## 6  °--LessThan50
  1. You can plot the partition trees to viaualize groupings. Simply call the plot function.
plot(groupsLeaf)
plot(groupsRoot)
plot(groups)
  1. The package lets you add qualityChecks as groupings too. Here the flags generated will be used to partition data

More on that –> Discuss with Ashwin

Advanced

  1. Creating your own groupings –> complete this and add more groups in addition to (‘all’ and ‘primary3’)