California cactus wren

I am interested in finding occurrence data for the threatened California cactus wren, native to coastal southern California.

Resources

rOpenSci

http://ropensci.org/packages/

This open data group has developed a solid colletion of R packages to make scientific data more accessible. The packages include those to retrieve climate data (GSODR, NOAA, World Bank Climate, Hydrological, Air quality), search museum records (paleobioDB), and get species occurrence data from major databases.

Species occurrences:
GBIF, BISON, spocc, Fishbase, eBird, VertNet, iNaturalist

SPOCC

Searches from GBIF, Berkeley Ecoengine, iNaturalist, VertNet, BISON, ebird, AntWeb, iDigBio.

install.packages('spocc')
library(spocc) ##search multiple databases
library(dplyr)

occ_count(taxonKey = , georeferenced = TRUE) #return number of occurrence records

##search for data
cawr<-'Campylorhynchus brunneicapillus' ##California cactus wren
cawr_obs<-occ(query = cawr, from=c('gbif','bison'), limit = 1000, has_coords = TRUE)
##limit is per source

The results from each data source is indexed with ‘$’ containing ‘meta’ & ‘data’,

str(cawr_obs$gbif$meta) ##found over 113,000 georeferenced records
## List of 6
##  $ source  : chr "gbif"
##  $ time    : POSIXct[1:1], format: "2017-03-12 20:05:41"
##  $ found   : int 113829
##  $ returned: int 1000
##  $ type    : chr "sci"
##  $ opts    :List of 4
##   ..$ hasCoordinate : logi TRUE
##   ..$ scientificName: chr "Campylorhynchus brunneicapillus"
##   ..$ limit         : num 1000
##   ..$ fields        : chr "all"
str(cawr_obs$bison$meta) ##71819
## List of 6
##  $ source  : chr "bison"
##  $ time    : POSIXct[1:1], format: "2017-03-12 20:05:46"
##  $ found   : int 71819
##  $ returned: int 1000
##  $ type    : chr "sci"
##  $ opts    :List of 4
##   ..$ scientificName: chr "Campylorhynchus brunneicapillus"
##   ..$ verbose       : logi FALSE
##   ..$ rows          : num 1000
##   ..$ callopts      : list()
head(cawr_obs$gbif$data$Campylorhynchus_brunneicapillus)[1:6, 1:10]
## # A tibble: 6 × 10
##                              name longitude latitude         issues  prov
##                             <chr>     <dbl>    <dbl>          <chr> <chr>
## 1 Campylorhynchus brunneicapillus -102.8253 20.70048         gass84  gbif
## 2 Campylorhynchus brunneicapillus -102.9788 21.12558         gass84  gbif
## 3 Campylorhynchus brunneicapillus -111.5015 33.67817 cdround,gass84  gbif
## 4 Campylorhynchus brunneicapillus -102.9784 21.12494         gass84  gbif
## 5 Campylorhynchus brunneicapillus -111.1686 32.24244 cdround,gass84  gbif
## 6 Campylorhynchus brunneicapillus -109.8134 23.42220         gass84  gbif
## # ... with 5 more variables: key <int>, datasetKey <chr>,
## #   publishingOrgKey <chr>, publishingCountry <chr>, protocol <chr>
head(cawr_obs$bison$data$Campylorhynchus_brunneicapillus)[1:6, 1:10]
## # A tibble: 6 × 10
##   computedCountyFips providerID catalogNumber basisOfRecord countryCode
##                <chr>      <int>         <chr>         <chr>       <chr>
## 1              06073        602  OBS189249644   observation          US
## 2              35013        602  OBS177525672   observation          US
## 3              06065        602  OBS175883438   observation          US
## 4              04019        602  OBS152239982   observation          US
## 5              04019        602  OBS144378546   observation          US
## 6              35013        602  OBS165216866   observation          US
## # ... with 5 more variables: ITISscientificName <chr>, latlon <chr>,
## #   calculatedState <chr>, longitude <dbl>, year <int>

You can see that the different providers return dfferently organized dataframes. Some contain more information than others or unique keys to identify duplicate observations across sources.

Duplicate observations

Several of the ‘spocc’ data providers interface with each other. This means some observations may be duplicated if documented in multiple locations. This should be addressed when gathering from multiple sources. In particular, GBIF gathers data from many of the other providers like iNaturalist, which is documented in the dataframe returned by GBIF.

df<-cawr_obs$gbif$data$Campylorhynchus_brunneicapillus
unique(df$institutionCode) ##list sources of GBIF data
## [1] "iNaturalist" "CLO"         "MLZ"

The ‘occ2df’ call combines the results from multiple sources into a single data.frame with basic lat/lon, provider, date, and key.

obs<-occ2df(cawr_obs)
head(obs)
## # A tibble: 6 × 6
##                              name longitude latitude  prov       date
##                             <chr>     <dbl>    <dbl> <chr>     <date>
## 1 Campylorhynchus brunneicapillus -102.8253 20.70048  gbif 2016-01-25
## 2 Campylorhynchus brunneicapillus -102.9788 21.12558  gbif 2016-01-31
## 3 Campylorhynchus brunneicapillus -111.5015 33.67817  gbif 2016-01-01
## 4 Campylorhynchus brunneicapillus -102.9784 21.12494  gbif 2016-01-31
## 5 Campylorhynchus brunneicapillus -111.1686 32.24244  gbif 2016-01-07
## 6 Campylorhynchus brunneicapillus -109.8134 23.42220  gbif 2016-01-13
## # ... with 1 more variables: key <chr>

GBIF

Global Biodiversity Information Facility
716,804,704 occurrences
1,643,948 species
31,883 datasets

install.packages('rgbif')
library(rgbif) 

Search for records by species name:

out<-occ_search(scientificName=cawr, limit=1000, hasCoordinate = TRUE, year = NULL)
out$meta ##look at metadata
## $offset
## [1] 900
## 
## $limit
## [1] 100
## 
## $endOfRecords
## [1] FALSE
## 
## $count
## [1] 113829
head(out$data)
## # A tibble: 6 × 101
##                              name        key decimalLatitude
##                             <chr>      <int>           <dbl>
## 1 Campylorhynchus brunneicapillus 1249278001        20.70048
## 2 Campylorhynchus brunneicapillus 1249286772        21.12558
## 3 Campylorhynchus brunneicapillus 1323023876        33.67817
## 4 Campylorhynchus brunneicapillus 1249286782        21.12494
## 5 Campylorhynchus brunneicapillus 1229610219        32.24244
## 6 Campylorhynchus brunneicapillus 1269556454        23.42220
## # ... with 98 more variables: decimalLongitude <dbl>, issues <chr>,
## #   datasetKey <chr>, publishingOrgKey <chr>, publishingCountry <chr>,
## #   protocol <chr>, lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
## #   extensions <chr>, basisOfRecord <chr>, taxonKey <int>,
## #   kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
## #   familyKey <int>, genusKey <int>, speciesKey <int>,
## #   scientificName <chr>, kingdom <chr>, phylum <chr>, order <chr>,
## #   family <chr>, genus <chr>, species <chr>, genericName <chr>,
## #   specificEpithet <chr>, taxonRank <chr>, dateIdentified <chr>,
## #   coordinateUncertaintyInMeters <dbl>, year <int>, month <int>,
## #   day <int>, eventDate <chr>, modified <chr>, lastInterpreted <chr>,
## #   references <chr>, license <chr>, identifiers <chr>, facts <chr>,
## #   relations <chr>, geodeticDatum <chr>, class <chr>, countryCode <chr>,
## #   country <chr>, rightsHolder <chr>, identifier <chr>,
## #   verbatimEventDate <chr>, datasetName <chr>, collectionCode <chr>,
## #   gbifID <chr>, verbatimLocality <chr>, occurrenceID <chr>,
## #   taxonID <chr>, catalogNumber <chr>, recordedBy <chr>,
## #   http...unknown.org.occurrenceDetails <chr>, institutionCode <chr>,
## #   rights <chr>, eventTime <chr>, identificationID <chr>,
## #   occurrenceRemarks <chr>, informationWithheld <chr>,
## #   infraspecificEpithet <chr>, individualCount <int>,
## #   stateProvince <chr>, locality <chr>, county <chr>, sex <chr>,
## #   elevation <dbl>, elevationAccuracy <dbl>, continent <chr>,
## #   habitat <chr>, institutionID <chr>, dynamicProperties <chr>,
## #   identificationVerificationStatus <chr>, language <chr>, type <chr>,
## #   locationAccordingTo <chr>, preparations <chr>, identifiedBy <chr>,
## #   georeferencedDate <chr>, higherGeography <chr>,
## #   nomenclaturalCode <chr>, georeferencedBy <chr>,
## #   georeferenceProtocol <chr>, georeferenceVerificationStatus <chr>,
## #   endDayOfYear <chr>, verbatimCoordinateSystem <chr>,
## #   otherCatalogNumbers <chr>, organismID <chr>,
## #   previousIdentifications <chr>, identificationQualifier <chr>,
## #   samplingProtocol <chr>, accessRights <chr>,
## #   higherClassification <chr>, georeferenceSources <chr>

Search for records for a given time period & location:

##observations in last 30 years
out<-occ_search(scientificName=cawr, limit=1000, hasCoordinate = TRUE, year = '1997,2017')
df<-out$data
library(ggmap)
library(ggplot2)
#socal<-make_bbox(lon = , lat=)
#out<-occ_search(scientificName=cawr, limit=1000, hasCoordinate = TRUE, year = '1997,2017'. geometry = )

library(maps)

counties <- map_data("county")%>%subset(region == "california")##shapefiles for CA counties
ggplot()+
  geom_polygon(data=counties, aes(x=long, y= lat, group=group), color='lightgrey', fill=NA)+
  geom_point(data=df, aes(x=decimalLongitude, y=decimalLatitude), shape=1, size=2, alpha=.85, color='darkblue')+
  coord_fixed(xlim = c(-123, -113.0),  ylim = c(32, 37), ratio = 1.3)+
  theme_nothing()

For more on the ‘spocc’ package: http://ropensci.org/tutorials/spocc_tutorial.html
Additionally, see the ‘scrubr’ package for useful functions to clean biological occurrence records quickly (remove impossible lat/lon, clean duplicates, habitat filtering, geographic errors).