The GBIF data is huge and complex. Long version has around 235 fields and short version has around 110 fields. But as you would have experienced, most of the fields doesn’t have any importance. Merely around 5 or 10 fields are of much importance most of the time. Further, the fields are all jumbled up in no sensical order. There are no categorical divisions in the groups.
The following set of functions tries to eliminate that issue.
Let’s start from working on some use cases we might come across. I have a data set called australianMammals (Australian Mammals data of 235 columns and 1758193 rows). So let’s first try to categorize these columns into common groups as classified by GBIF.
# Note: What are these groups classified by GBIF?
# 1) taxonClass - Taxonomic names, taxon name usages, or taxon concepts.
# 2) locationClass - A spatial region or named place. Terms describing a place, whether named or not.
# 3) eventClass - Information of an event (an action that occurs at a place and during a period of time)
# 4) occurenceClass - Evidence of an occurrence in nature, or in a collection (specimen, observation, etc.)
# 5) recordlevelTermsClass - Terms apply to the whole record regardless of the record type.
# 6) geologicalContextClass - Information of a location within a geological context, such as stratigraphy.
# 7) identificationClass - Taxonomic determinations (the assignment of a scientific name)
# 8) resourceRelationshipClass
# 9) measurementOrFactClass
# In other words, we can say,
# taxonClass is taxonomical data
# locationClass is spatial data
# eventClass is temporal data
# For full list of fields in each groups, refer appendix
allGroups <- addPartition(grouping = "all")
results <- applyPartitions(allGroups, australianMammals)
That’s it. simple as that. Let’s see how the results are created.
What happens is, in the first line we are creating a partition tree. This tree will decide how the final partition is done. Let’s see how the partition tree looks like.
print(allGroups)
## levelName
## 1 Partition Tree
## 2 ¦--taxonClass
## 3 ¦--locationClass
## 4 ¦--eventClass
## 5 ¦--occurenceClass
## 6 ¦--recordlevelTermsClass
## 7 ¦--geologicalContextClass
## 8 ¦--identificationClass
## 9 ¦--resourceRelationshipClass
## 10 °--measurementOrFactClass
So, the original data is divided into 10 partitions. In the second line, we are applying that partition to the data.
summary(results)
## Length Class Mode
## Partition Tree 235 data.frame list
## Partition Tree/taxonClass 29 data.frame list
## Partition Tree/locationClass 35 data.frame list
## Partition Tree/eventClass 15 data.frame list
## Partition Tree/occurenceClass 20 data.frame list
## Partition Tree/recordlevelTermsClass 11 data.frame list
## Partition Tree/geologicalContextClass 18 data.frame list
## Partition Tree/identificationClass 8 data.frame list
## Partition Tree/resourceRelationshipClass 0 data.frame list
## Partition Tree/measurementOrFactClass 0 data.frame list
Here you can see, the result is a list of 10 dataframes. Each dataframe has different number of fields. let’s see what are the fields in eventClass.
names(results$'Partition Tree/eventClass')
## [1] "eventID" "samplingProtocol" "samplingEffort"
## [4] "eventDate" "eventTime" "startDayOfYear"
## [7] "endDayOfYear" "year" "month"
## [10] "day" "verbatimEventDate" "habitat"
## [13] "fieldNumber" "fieldNotes" "eventRemarks"
Here, all the fileds related to temporal aspects are grouped together, which is what we need.
return parameterresults <- applyPartitions(allGroups, australianMammals, return = "dataframe")
dim(results)
## [1] 6 136
head(names(results) ,30)
## [1] "taxonID" "scientificNameID" "acceptedNameUsageID"
## [4] "parentNameUsageID" "originalNameUsageID" "nameAccordingToID"
## [7] "namePublishedInID" "taxonConceptID" "scientificName"
## [10] "acceptedNameUsage" "parentNameUsage" "originalNameUsage"
## [13] "nameAccordingTo" "namePublishedIn" "namePublishedInYear"
## [16] "higherClassification" "kingdom" "phylum"
## [19] "class" "order" "family"
## [22] "genus" "subgenus" "specificEpithet"
## [25] "infraspecificEpithet" "taxonRank" "verbatimTaxonRank"
## [28] "vernacularName" "nomenclaturalCode" "locationID"
if you see closely, all related fields are grouped together first 29 being taxonClass followed by locationClass. But important thing is you are getting a dataframe itself which is grouped accordingly within.
primaryGroups <- addPartition(grouping = "primary3")
results <- applyPartitions(primaryGroups, australianMammals)
summary(results)
## Length Class Mode
## Partition Tree 235 data.frame list
## Partition Tree/taxonClass 29 data.frame list
## Partition Tree/locationClass 35 data.frame list
## Partition Tree/eventClass 15 data.frame list
So this returns only the fields related to these classes.
taxonGroup <- addPartition(grouping = "taxonClass")
print(taxonGroup)
## levelName
## 1 Partition Tree
## 2 °--taxonClass
results <- applyPartitions(taxonGroup, australianMammals, return = "dataframe")
dim(results)
## [1] 6 29
The above section was entirely on how partitions are done based on columns. You can even add partition based on rows or even quality checks.
groups <- addPartition(grouping = "column", column = "scientificName~Vulpes Frisch, 1775", data = australianMammals)
Note here, the parameters have changed alot. when you make a partition based on a column, then grouping paramater shold be column. The column parameter should have the ~. Here it’s "scientificName~Vulpes Frisch, 1775". And the dataframe should be given too. Lets see how parttion looks like.
groups
## levelName
## 1 Partition Tree
## 2 ¦--scientificName=Vulpes Frisch, 1775
## 3 °--scientificName!=Vulpes Frisch, 1775
~ (like "scientificName~"), then the partition will be all sets of unique scientificNames.groups <- addPartition(grouping = "column", column = "scientificName~", data = australianMammals)
groups
## levelName
## 1 Partition Tree
## 2 ¦--Oryctolagus cuniculus (Linnaeus, 1758)
## 3 ¦--Tachyglossus aculeatus (Shaw, 1792)
## 4 ¦--Vombatus ursinus (Shaw, 1800)
## 5 ¦--Vulpes Frisch, 1775
## 6 °--Macropus giganteus Shaw, 1790
Exciting huh?
column=columnValue), you can even run a cumulative function to partition based on rows. For example, if you don’t want any columns with more than 70% missing row values,filledData<- addPartition(grouping = "greaterThan70")
print(filledData)
## levelName
## 1 Partition Tree
## 2 ¦--GreaterThan70
## 3 °--LessThan70
The function is intelligent. You can input any value in place of 70. It will work fine. And when you ask for greaterThan70, automatically lessThan70 is also added.
primaryFilledData<- addPartition(grouping = "primary3")
addPartition(primaryFilledData, grouping = "greaterThan50")
## levelName
## 1 Partition Tree
## 2 ¦--taxonClass
## 3 ¦ ¦--GreaterThan50
## 4 ¦ °--LessThan50
## 5 ¦--locationClass
## 6 ¦ ¦--GreaterThan50
## 7 ¦ °--LessThan50
## 8 °--eventClass
## 9 ¦--GreaterThan50
## 10 °--LessThan50
Here, we will get 6 dataframes as outputs.
listPartitions(primaryFilledData)
## [1] "Partition Tree"
## [2] "Partition Tree -> taxonClass"
## [3] "Partition Tree -> locationClass"
## [4] "Partition Tree -> eventClass"
## [5] "Partition Tree -> taxonClass -> GreaterThan50"
## [6] "Partition Tree -> taxonClass -> LessThan50"
## [7] "Partition Tree -> locationClass -> GreaterThan50"
## [8] "Partition Tree -> locationClass -> LessThan50"
## [9] "Partition Tree -> eventClass -> GreaterThan50"
## [10] "Partition Tree -> eventClass -> LessThan50"
If you want, you can delete any of the values here and give new vector as grouping list for the applypartition() function.
oldList <- listPartitions(primaryFilledData)
newList <- oldList[-c(4:7)]
newList
## [1] "Partition Tree"
## [2] "Partition Tree -> taxonClass"
## [3] "Partition Tree -> locationClass"
## [4] "Partition Tree -> locationClass -> LessThan50"
## [5] "Partition Tree -> eventClass -> GreaterThan50"
## [6] "Partition Tree -> eventClass -> LessThan50"
newParttion <- applyPartitions(groupList = newList, data = australianMammals)
This is so because user gets the entire history of partition as actionable datasets so, it gives much flexibility. But if you don’t want that then use onlyLeaf parameter in both listPartitions() and applyPartitions()
listPartitions(primaryFilledData, onlyLeaf = TRUE)
## [1] "Partition Tree -> taxonClass -> GreaterThan50"
## [2] "Partition Tree -> taxonClass -> LessThan50"
## [3] "Partition Tree -> locationClass -> GreaterThan50"
## [4] "Partition Tree -> locationClass -> LessThan50"
## [5] "Partition Tree -> eventClass -> GreaterThan50"
## [6] "Partition Tree -> eventClass -> LessThan50"
applyTo parameter. For example,groupsLeaf <- addPartition(grouping = "primary3")
addPartition(groupsLeaf, grouping = "greaterThan50", applyTo = "leaf")
## levelName
## 1 Partition Tree
## 2 ¦--taxonClass
## 3 ¦ ¦--GreaterThan50
## 4 ¦ °--LessThan50
## 5 ¦--locationClass
## 6 ¦ ¦--GreaterThan50
## 7 ¦ °--LessThan50
## 8 °--eventClass
## 9 ¦--GreaterThan50
## 10 °--LessThan50
This is the defalut option. But the following is different.
groupsRoot <- addPartition(grouping = "primary3")
addPartition(groupsRoot, grouping = "greaterThan50", applyTo = "root")
## levelName
## 1 Partition Tree
## 2 ¦--taxonClass
## 3 ¦--locationClass
## 4 ¦--eventClass
## 5 ¦--GreaterThan50
## 6 °--LessThan50
plot(groupsLeaf)
plot(groupsRoot)
plot(groups)
More on that –> Discuss with Ashwin