I am interested in finding occurrence data for the threatened California cactus wren, native to coastal southern California.
This open data group has developed a solid colletion of R packages to make scientific data more accessible. The packages include those to retrieve climate data (GSODR, NOAA, World Bank Climate, Hydrological, Air quality), search museum records (paleobioDB), and get species occurrence data from major databases.
Species occurrences:
GBIF, BISON, spocc, Fishbase, eBird, VertNet, iNaturalist
Searches from GBIF, Berkeley Ecoengine, iNaturalist, VertNet, BISON, ebird, AntWeb, iDigBio.
install.packages('spocc')
library(spocc) ##search multiple databases
library(dplyr)
occ_count(taxonKey = , georeferenced = TRUE) #return number of occurrence records
##search for data
cawr<-'Campylorhynchus brunneicapillus' ##California cactus wren
cawr_obs<-occ(query = cawr, from=c('gbif','bison'), limit = 1000, has_coords = TRUE)
##limit is per source
The results from each data source is indexed with ‘$’ containing ‘meta’ & ‘data’,
str(cawr_obs$gbif$meta) ##found over 113,000 georeferenced records
## List of 6
## $ source : chr "gbif"
## $ time : POSIXct[1:1], format: "2017-03-12 20:05:41"
## $ found : int 113829
## $ returned: int 1000
## $ type : chr "sci"
## $ opts :List of 4
## ..$ hasCoordinate : logi TRUE
## ..$ scientificName: chr "Campylorhynchus brunneicapillus"
## ..$ limit : num 1000
## ..$ fields : chr "all"
str(cawr_obs$bison$meta) ##71819
## List of 6
## $ source : chr "bison"
## $ time : POSIXct[1:1], format: "2017-03-12 20:05:46"
## $ found : int 71819
## $ returned: int 1000
## $ type : chr "sci"
## $ opts :List of 4
## ..$ scientificName: chr "Campylorhynchus brunneicapillus"
## ..$ verbose : logi FALSE
## ..$ rows : num 1000
## ..$ callopts : list()
head(cawr_obs$gbif$data$Campylorhynchus_brunneicapillus)[1:6, 1:10]
## # A tibble: 6 × 10
## name longitude latitude issues prov
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Campylorhynchus brunneicapillus -102.8253 20.70048 gass84 gbif
## 2 Campylorhynchus brunneicapillus -102.9788 21.12558 gass84 gbif
## 3 Campylorhynchus brunneicapillus -111.5015 33.67817 cdround,gass84 gbif
## 4 Campylorhynchus brunneicapillus -102.9784 21.12494 gass84 gbif
## 5 Campylorhynchus brunneicapillus -111.1686 32.24244 cdround,gass84 gbif
## 6 Campylorhynchus brunneicapillus -109.8134 23.42220 gass84 gbif
## # ... with 5 more variables: key <int>, datasetKey <chr>,
## # publishingOrgKey <chr>, publishingCountry <chr>, protocol <chr>
head(cawr_obs$bison$data$Campylorhynchus_brunneicapillus)[1:6, 1:10]
## # A tibble: 6 × 10
## computedCountyFips providerID catalogNumber basisOfRecord countryCode
## <chr> <int> <chr> <chr> <chr>
## 1 06073 602 OBS189249644 observation US
## 2 35013 602 OBS177525672 observation US
## 3 06065 602 OBS175883438 observation US
## 4 04019 602 OBS152239982 observation US
## 5 04019 602 OBS144378546 observation US
## 6 35013 602 OBS165216866 observation US
## # ... with 5 more variables: ITISscientificName <chr>, latlon <chr>,
## # calculatedState <chr>, longitude <dbl>, year <int>
You can see that the different providers return dfferently organized dataframes. Some contain more information than others or unique keys to identify duplicate observations across sources.
Several of the ‘spocc’ data providers interface with each other. This means some observations may be duplicated if documented in multiple locations. This should be addressed when gathering from multiple sources. In particular, GBIF gathers data from many of the other providers like iNaturalist, which is documented in the dataframe returned by GBIF.
df<-cawr_obs$gbif$data$Campylorhynchus_brunneicapillus
unique(df$institutionCode) ##list sources of GBIF data
## [1] "iNaturalist" "CLO" "MLZ"
The ‘occ2df’ call combines the results from multiple sources into a single data.frame with basic lat/lon, provider, date, and key.
obs<-occ2df(cawr_obs)
head(obs)
## # A tibble: 6 × 6
## name longitude latitude prov date
## <chr> <dbl> <dbl> <chr> <date>
## 1 Campylorhynchus brunneicapillus -102.8253 20.70048 gbif 2016-01-25
## 2 Campylorhynchus brunneicapillus -102.9788 21.12558 gbif 2016-01-31
## 3 Campylorhynchus brunneicapillus -111.5015 33.67817 gbif 2016-01-01
## 4 Campylorhynchus brunneicapillus -102.9784 21.12494 gbif 2016-01-31
## 5 Campylorhynchus brunneicapillus -111.1686 32.24244 gbif 2016-01-07
## 6 Campylorhynchus brunneicapillus -109.8134 23.42220 gbif 2016-01-13
## # ... with 1 more variables: key <chr>
Global Biodiversity Information Facility
716,804,704 occurrences
1,643,948 species
31,883 datasets
install.packages('rgbif')
library(rgbif)
Search for records by species name:
out<-occ_search(scientificName=cawr, limit=1000, hasCoordinate = TRUE, year = NULL)
out$meta ##look at metadata
## $offset
## [1] 900
##
## $limit
## [1] 100
##
## $endOfRecords
## [1] FALSE
##
## $count
## [1] 113829
head(out$data)
## # A tibble: 6 × 101
## name key decimalLatitude
## <chr> <int> <dbl>
## 1 Campylorhynchus brunneicapillus 1249278001 20.70048
## 2 Campylorhynchus brunneicapillus 1249286772 21.12558
## 3 Campylorhynchus brunneicapillus 1323023876 33.67817
## 4 Campylorhynchus brunneicapillus 1249286782 21.12494
## 5 Campylorhynchus brunneicapillus 1229610219 32.24244
## 6 Campylorhynchus brunneicapillus 1269556454 23.42220
## # ... with 98 more variables: decimalLongitude <dbl>, issues <chr>,
## # datasetKey <chr>, publishingOrgKey <chr>, publishingCountry <chr>,
## # protocol <chr>, lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
## # extensions <chr>, basisOfRecord <chr>, taxonKey <int>,
## # kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
## # familyKey <int>, genusKey <int>, speciesKey <int>,
## # scientificName <chr>, kingdom <chr>, phylum <chr>, order <chr>,
## # family <chr>, genus <chr>, species <chr>, genericName <chr>,
## # specificEpithet <chr>, taxonRank <chr>, dateIdentified <chr>,
## # coordinateUncertaintyInMeters <dbl>, year <int>, month <int>,
## # day <int>, eventDate <chr>, modified <chr>, lastInterpreted <chr>,
## # references <chr>, license <chr>, identifiers <chr>, facts <chr>,
## # relations <chr>, geodeticDatum <chr>, class <chr>, countryCode <chr>,
## # country <chr>, rightsHolder <chr>, identifier <chr>,
## # verbatimEventDate <chr>, datasetName <chr>, collectionCode <chr>,
## # gbifID <chr>, verbatimLocality <chr>, occurrenceID <chr>,
## # taxonID <chr>, catalogNumber <chr>, recordedBy <chr>,
## # http...unknown.org.occurrenceDetails <chr>, institutionCode <chr>,
## # rights <chr>, eventTime <chr>, identificationID <chr>,
## # occurrenceRemarks <chr>, informationWithheld <chr>,
## # infraspecificEpithet <chr>, individualCount <int>,
## # stateProvince <chr>, locality <chr>, county <chr>, sex <chr>,
## # elevation <dbl>, elevationAccuracy <dbl>, continent <chr>,
## # habitat <chr>, institutionID <chr>, dynamicProperties <chr>,
## # identificationVerificationStatus <chr>, language <chr>, type <chr>,
## # locationAccordingTo <chr>, preparations <chr>, identifiedBy <chr>,
## # georeferencedDate <chr>, higherGeography <chr>,
## # nomenclaturalCode <chr>, georeferencedBy <chr>,
## # georeferenceProtocol <chr>, georeferenceVerificationStatus <chr>,
## # endDayOfYear <chr>, verbatimCoordinateSystem <chr>,
## # otherCatalogNumbers <chr>, organismID <chr>,
## # previousIdentifications <chr>, identificationQualifier <chr>,
## # samplingProtocol <chr>, accessRights <chr>,
## # higherClassification <chr>, georeferenceSources <chr>
Search for records for a given time period & location:
##observations in last 30 years
out<-occ_search(scientificName=cawr, limit=1000, hasCoordinate = TRUE, year = '1997,2017')
df<-out$data
library(ggmap)
library(ggplot2)
#socal<-make_bbox(lon = , lat=)
#out<-occ_search(scientificName=cawr, limit=1000, hasCoordinate = TRUE, year = '1997,2017'. geometry = )
library(maps)
counties <- map_data("county")%>%subset(region == "california")##shapefiles for CA counties
ggplot()+
geom_polygon(data=counties, aes(x=long, y= lat, group=group), color='lightgrey', fill=NA)+
geom_point(data=df, aes(x=decimalLongitude, y=decimalLatitude), shape=1, size=2, alpha=.85, color='darkblue')+
coord_fixed(xlim = c(-123, -113.0), ylim = c(32, 37), ratio = 1.3)+
theme_nothing()
For more on the ‘spocc’ package: http://ropensci.org/tutorials/spocc_tutorial.html
Additionally, see the ‘scrubr’ package for useful functions to clean biological occurrence records quickly (remove impossible lat/lon, clean duplicates, habitat filtering, geographic errors).