Evaluating species distribution poses significant challenges due to the biogeographical complexity spanning diverse habitats. Key obstacles include accessibility to remote regions, the need for substantial resources, political issues, variations in research infrastructure, cryptic or rare species, and temporal and climatic changes. Geographical variability and ecosystem diversity necessitate collaborative approaches and innovative technologies to obtain comprehensive and representative data on a global scale.

In this context, databases storing species occurrence records, and in some cases, abundance and biomass data, provide a broader perspective, enhancing information gathered through specific research surveys. These databases are crucial for identifying areas of high diversity and establishing conservation strategies for threatened species. Additionally, they enable the assessment of evolutionary biology through biogeographic patterns, planning of agricultural and fishing activities, urban planning, and promotion of environmental education.

In the field of statistical programming with R, we have packages such as dismo, rgbif, and ntbox that enable access to valuable information from the Global Biodiversity Information Facility (GBIF). GBIF collaborates with partners offering online access to other databases like the USGS Bison database, iDigBio, and iNaturalist which records similar information. Specific resources, such as eBird, ALA, VertNet, and OBIS, can be accessed directly using the spocc package. This diversity of sources enriches data access and expands research and conservation possibilities, providing a comprehensive perspective on biodiversity distribution. This benefits both the scientific community and informed decision-making in environmental and sustainable development issues.

It is crucial to note that the information stored in these databases may contain errors, emphasizing the necessity to review the records using the appropriate management of these platforms. This involves eliminating data lacking longitude and latitude information, as well as suspicious records that do not include individual organism counts, registration codes, or the name of the database. Additionally, it is recommended to incorporate information available in grey literature, such as reports and theses, directly into the selected database for the work. Furthermore, reviewing information provided in publications focused on the species of interest, as well as the communities and environments where they have been recorded or could be found, is advised. Lastly, it is important that all biodiversity assessments include, as part of their objectives, the proper recording of species capture information in the database, aligning with the intrinsic characteristics of the species or communities being evaluated.

The following lines of R code will explore various options to download and plot this data using basic R functions, as well as ggplot2, maps, and raster packages. First, the approach that has GBIF as its main database is presented:

dismo packages

g1 = gbif('Callinectes','similis', geo = TRUE, ntries = 1, nrecs = 300)
## 59349 records found
## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6600-6900-7200-7500-7800-8100-8400-8700-9000-9300-9600-9900-10200-10500-10800-11100-11400-11700-12000-12300-12600-12900-13200-13500-13800-14100-14400-14700-15000-15300-15600-15900-16200-16500-16800-17100-17400-17700-18000-18300-18600-18900-19200-19500-19800-20100-20400-20700-21000-21300-21600-21900-22200-22500-22800-23100-23400-23700-24000-24300-24600-24900-25200-25500-25800-26100-26400-26700-27000-27300-27600-27900-28200-28500-28800-29100-29400-29700-30000-30300-30600-30900-31200-31500-31800-32100-32400-32700-33000-33300-33600-33900-34200-34500-34800-35100-35400-35700-36000-36300-36600-36900-37200-37500-37800-38100-38400-38700-39000-39300-39600-39900-40200-40500-40800-41100-41400-41700-42000-42300-42600-42900-43200-43500-43800-44100-44400-44700-45000-45300-45600-45900-46200-46500-46800-47100-47400-47700-48000-48300-48600-48900-49200-49500-49800-50100-50400-50700-51000-51300-51600-51900-52200-52500-52800-53100-53400-53700-54000-54300-54600-54900-55200-55500-55800-56100-56400-56700-57000-57300-57600-57900-58200-58500-58800-59100-59349 records downloaded
dim(g1)
## [1] 59349   157
head(g1[,1:5])
##   acceptedNameUsage             acceptedScientificName acceptedTaxonKey
## 1              <NA> Callinectes similis Williams, 1966          2225642
## 2              <NA> Callinectes similis Williams, 1966          2225642
## 3              <NA> Callinectes similis Williams, 1966          2225642
## 4              <NA> Callinectes similis Williams, 1966          2225642
## 5              <NA> Callinectes similis Williams, 1966          2225642
## 6              <NA> Callinectes similis Williams, 1966          2225642
##   accessRights           adm1
## 1         <NA>        Georgia
## 2         <NA>        Alabama
## 3         <NA>        Alabama
## 4         <NA>        Florida
## 5         <NA> South Carolina
## 6         <NA> North Carolina

Checking the dimensions of the data frame, downloading it later, and sorting it while removing NA values in the essential categories reveal differences between the total number of records and the number of reliable records needed to achieve this goal.

dim(g1)
## [1] 59349   157
g11 = g1[!is.na(g1$individualCount) & !is.na(g1$lon) & !is.na(g1$lat) & !is.na(g1$datasetName),]
dim(g11)
## [1] 742 157

Basic plot Callinectes similis

plot(g11$lon, g11$lat, col = "red", pch = 19, xlab = "Longitude", ylab = "Latitude", xlim = c(-100, -65), ylim = c(17, 40),
     main = expression(paste("Distribution data for ", italic("Callinectes similis"))))

legend("topright", legend = expression(italic("C. similis")),
       col = c("red"), pch = 19, box.lwd = 1)

map(add=T)
Figure 1. Distribution data for Callinectes similis available from the GBIF database

Figure 1. Distribution data for Callinectes similis available from the GBIF database

Representing distribution of Callinectes similis using ggplot2 and maps packages

map_data <- map_data("world")

MapG1 = ggplot(g11, aes(x = lon, y = lat)) +
  geom_point(aes(color = individualCount), size = 2) +
  #geom_text(aes(label = seq), nudge_x = 0.5, nudge_y = 0.1) +
  coord_cartesian(ylim=c(17,45),xlim =c(-100,-65)) +
  labs( x = "Longitude", y = "Latitude", color = "Count") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  theme(text= element_text(size=12, family = "Arial", colour = "black"),
        axis.text.x = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain", colour = "black", size = 17),
        axis.text.y = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17),
        axis.title.x = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17),
        axis.title.y = element_text(angle = 90,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17))+
  borders("world", colour = "black", size= 1) + 
 
     ggtitle(expression(paste("Distribution data for ", italic("Callinectes similis")))) +
  theme(plot.title = element_text(hjust = 0.5))

MapG1 +
  geom_polygon(data = map_data, aes(x = long, y = lat, group = group), fill = "grey", color = "black", alpha = 0.5)
Figure 2. Distribution data for Callinectes similis available from the GBIF database

Figure 2. Distribution data for Callinectes similis available from the GBIF database

rgbif package

This package provides options to retrieve data based on depth, elevation, and geographic area among others. In this context, these options are evaluated using the depth distribution of Achelous spinicarpus Stimpson, 1871 (0-521 meters) and Achelous spinimanus Latreille, 1819 (0-393 meters).

g21 = occ_data(scientificName = "Achelous spinicarpus", hasCoordinate = TRUE, limit = 30000)
g21 = data.frame(g21$data)
head(g21[,1:5])
##          key                      scientificName decimalLatitude
## 1 4430762861 Achelous spinicarpus Stimpson, 1871       26.920171
## 2 3743232619 Achelous spinicarpus Stimpson, 1871       26.939775
## 3 3947829825 Achelous spinicarpus Stimpson, 1871       28.481204
## 4 3913938976 Achelous spinicarpus Stimpson, 1871        7.128995
## 5 3913938988 Achelous spinicarpus Stimpson, 1871        7.071733
## 6 3913939118 Achelous spinicarpus Stimpson, 1871        7.183760
##   decimalLongitude                      issues
## 1        -96.53049                cdround,cudc
## 2        -84.08257                cdround,cudc
## 3        -96.42286                 cdc,cdround
## 4        -55.93917 cdround,depmms,depunl,osiic
## 5        -56.12123 cdround,depmms,depunl,osiic
## 6        -56.24220 cdround,depmms,depunl,osiic
dim(g21)
## [1] 1428  158
g211 = g21[!is.na(g21$individualCount) & !is.na(g21$decimalLongitude) & !is.na(g21$decimalLatitude) & !is.na(g21$datasetName),]
dim(g211)
## [1] 515 158
g22 = occ_data(scientificName = "Achelous spinimanus", hasCoordinate = TRUE, limit = 30000)
g22 = data.frame(g22$data)
dim(g22)
## [1] 1140  167
g221 = g22[!is.na(g22$individualCount) & !is.na(g22$decimalLongitude) & !is.na(g22$decimalLatitude) & !is.na(g22$datasetName),]
dim(g221)
## [1] 418 167

Here, we can examine the depth range of the registered data for A. spinicarpus (values in the first vector) and compare it to the recorded data for A. spinimanus (values in the second vector).

sort(unique(g211$depth))
##   [1]    0.00    0.05    2.50    3.00    3.75    5.25    6.00   14.20   15.00
##  [10]   16.00   18.00   19.60   22.50   23.00   24.00   24.72   25.00   25.10
##  [19]   26.50   26.75   27.00   27.75   28.00   30.40   31.50   32.00   33.00
##  [28]   33.50   35.00   37.00   37.50   38.00   38.40   38.50   39.00   40.00
##  [37]   41.00   41.50   42.00   43.00   44.00   45.00   45.70   46.00   47.00
##  [46]   48.00   48.50   49.00   49.40   50.00   50.20   51.00   52.00   53.00
##  [55]   53.50   54.00   55.00   55.50   56.00   56.70   57.00   57.50   58.00
##  [64]   58.50   58.60   59.00   59.50   60.00   61.00   62.20   63.00   64.00
##  [73]   65.00   65.80   66.00   67.00   68.00   69.00   69.50   70.00   71.00
##  [82]   71.30   72.00   73.00   75.00   76.20   77.00   78.00   79.00   80.50
##  [91]   82.00   85.00   86.00   88.00   89.50   90.00   91.00   91.50   92.00
## [100]   94.00   95.00   96.00   97.00  100.00  100.50  101.00  102.00  105.00
## [109]  106.00  108.00  109.50  110.00  113.00  116.00  116.50  119.00  120.00
## [118]  121.00  123.00  123.50  124.00  125.00  125.50  126.50  127.00  128.00
## [127]  137.00  139.00  146.00  152.50  155.00  159.00  161.00  164.50  165.00
## [136]  174.00  177.00  183.00  185.00  201.00  205.50  227.00  228.50  229.00
## [145]  233.50  237.50  238.00  259.00  260.50  270.00  283.00  290.00  301.50
## [154]  320.00  329.00  334.00  357.00  366.00  379.00  384.00  401.50  411.00
## [163]  412.00  415.00  421.00  425.00  457.00  457.50  549.00 3658.00
sort(unique(g221$depth))
##  [1]   0.00   0.25   0.50   1.00   2.00   5.00   6.00   6.30   7.00   7.90
## [11]   8.00   8.50   9.00   9.60  10.00  11.00  11.25  12.00  13.00  13.26
## [21]  14.17  15.00  15.20  15.50  16.00  16.50  17.70  17.83  18.00  18.10
## [31]  19.00  19.60  20.00  21.00  22.00  22.50  23.00  24.00  24.50  25.00
## [41]  25.20  26.00  27.00  27.75  29.00  30.00  30.40  31.00  31.50  32.00
## [51]  33.00  34.50  35.00  36.50  37.00  37.50  40.00  42.00  44.00  49.00
## [61]  49.50  50.20  51.00  52.00  53.00  55.00  55.50  56.00  57.00  57.50
## [71]  59.50  60.00  65.00  66.00  68.00  91.00 121.00 152.50 165.00 196.50
## [81] 293.00 320.00 366.00 393.00

Considering the available information, the recorded depth range for those two species is deemed accurate. The subsequent representation of these data utilizes basic functions and the maps package. The following maps present all the data without sorting them according to depth. This approach is justified by the shared habitat of the two species, making it more crucial to visualize the areas where both species are present.

summary(g211[,3:4])
##  decimalLatitude  decimalLongitude
##  Min.   :-32.68   Min.   :-97.74  
##  1st Qu.: 19.00   1st Qu.:-87.62  
##  Median : 24.75   Median :-82.85  
##  Mean   : 20.63   Mean   :-79.54  
##  3rd Qu.: 26.76   3rd Qu.:-76.57  
##  Max.   : 35.79   Max.   :  0.00
summary(g221[,3:4])
##  decimalLatitude  decimalLongitude
##  Min.   :-32.78   Min.   :-97.74  
##  1st Qu.: 20.95   1st Qu.:-88.04  
##  Median : 25.88   Median :-82.25  
##  Mean   : 22.63   Mean   :-79.63  
##  3rd Qu.: 30.23   3rd Qu.:-77.04  
##  Max.   : 40.47   Max.   :  0.00

Basic plot Achelous spinicarpus and Achelous spinimanus

plot(g211$decimalLongitude, g211$decimalLatitude, col = "red", pch = 16, cex = 0.5, xlab = "Longitude", ylab = "Latitude", xlim = c(-100,0), ylim = c(-32, 45), main = expression(paste("Distribution data for ", italic("A. spinicarpus")," (red) and ", italic("A. spinimanus")," (orange)")))

points(g221$decimalLongitude, g221$decimalLatitude, col = "orange", pch = 17, cex = 0.5)

legend("bottomright", legend = expression(italic("A. spinicarpus"), italic("A. spinimanus")),
       col = c("red", "orange"), pch = c(16, 17), box.lwd = 1)
map(add=T)
Figure 3. Distribution data for Achelous spinicarpus and Achelous spinimanus available from the GBIF database

Figure 3. Distribution data for Achelous spinicarpus and Achelous spinimanus available from the GBIF database

Representing distribution of Achelous spinicarpus and Achelous spinimanus using ggplot2 and maps packages.

Four genera and four species were observed due to the presence of synonymized names. The data for the species Portunus vossi Lemaitre, 1991 represents Achelous spinimanus as it is a synonymized name.

unique(g211$scientificName)
## [1] "Portunus spinicarpus (Stimpson, 1871)"
## [2] "Achelous spinicarpus Stimpson, 1871"
length(g211$scientificName)
## [1] 515
unique(g221$scientificName)
## [1] "Achelous spinimanus (Latreille, 1819)"
## [2] "Portunus spinimanus Latreille, 1819"  
## [3] "Portunus vossi Lemaitre, 1991"
length(g221$scientificName)
## [1] 418
g211$scientificName = rep("A. spinicarpus", 515)
unique(g211$scientificName)
## [1] "A. spinicarpus"
g221$scientificName = rep("A. spinimanus", 418)
unique(g221$scientificName)
## [1] "A. spinimanus"
map_data <- map_data("world")

MapG2 = ggplot() +
  geom_point(data = g211, aes(x = decimalLongitude, y = decimalLatitude, color = scientificName, size = individualCount)) +
  geom_point(data = g221, aes(x = decimalLongitude, y = decimalLatitude, color = scientificName, size = individualCount)) +
  #geom_text(aes(label = seq), nudge_x = 0.5, nudge_y = 0.1) +
  coord_cartesian(xlim = c(-100,0), ylim = c(-32, 45)) +
  labs( x = "Longitude", y = "Latitude", color = c("Species"), size = "Count") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  theme(text= element_text(size=12, family = "Arial", colour = "black"),
        axis.text.x = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain", colour = "black", size = 17),
        axis.text.y = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17),
        axis.title.x = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17),
        axis.title.y = element_text(angle = 90,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17))+
  borders("world", colour = "black", size= 1) +
 
  ggtitle(expression(paste("Distribution data for ", italic("A. spinicarpus")," and ",italic("A. spinimanus")))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  
  guides(
    color = guide_legend(
      label.theme = element_text(face = "italic")),
    size = guide_legend(
      label.theme = element_text(face = "plain")))

MapG2 +
 geom_polygon(data = map_data, aes(x = long, y = lat, group = group), fill = "grey", color = "black", alpha = 0.5)
Figure 4. Distribution data for Achelous spinicarpus and Achelous spinimanus available from the GBIF database

Figure 4. Distribution data for Achelous spinicarpus and Achelous spinimanus available from the GBIF database

ntbox package

This package features a graphical user interface; nevertheless, in this tutorial, we exclusively employ the function search_gbif_data for obtaining distribution data, but leafletplot parameter is used. Additionally, the package offers functions to retrieve environmental variables (e.g., WorldClim).

g3 = search_gbif_data("Phyllospongia","foliascens", occlim = 10000, writeFile = FALSE, leafletplot = FALSE, showClusters = FALSE)
head(g3[,1:5])
##                                      name longitude   latitude      issues prov
## 1 Phyllospongia foliascens (Pallas, 1766)  125.6085  -8.522772 cdc,cdround gbif
## 2 Phyllospongia foliascens (Pallas, 1766)  177.3317 -16.949136 cdc,cdround gbif
## 3 Phyllospongia foliascens (Pallas, 1766)  177.3323 -16.943435 cdc,cdround gbif
## 4 Phyllospongia foliascens (Pallas, 1766)  120.8912  13.686359 cdc,cdround gbif
## 5 Phyllospongia foliascens (Pallas, 1766)  120.8912  13.686359 cdc,cdround gbif
## 6 Phyllospongia foliascens (Pallas, 1766)  120.8912  13.686359 cdc,cdround gbif
mg3 = leaflet::leaflet(g3)
mg31 <- mg3 %>% leaflet::addTiles()
mg32 = mg31 %>% leaflet::addCircleMarkers(lng = ~ longitude,
                                         lat = ~latitude,
                                         fillOpacity = 0.25, 
                                          radius = 2,col="firebrick")
mg32

Figure 5. Distribution data of Phyllospongia foliascens available from the GBIF database

dim(g3)
## [1] 337 154
g31 = g3[!is.na(g3$individualCount) & !is.na(g3$longitude) & !is.na(g3$latitude) & !is.na(g3$datasetName),]
dim(g31)
## [1]   2 154

According to the database, Phyllospongia foliascens Pallas, 1766, and considering the four columns, the majority of records are not reliable.

intSDM package

Similar to rgbif and ntbox, it is possible to select a geographic area to download the data. In this case, it was used a shapefile of the Caribbean Island/Island Group Management Areas Map & GIS Data available through the NOAA FISHERIES website NOAA FISHERIES website. However, at the time of this tutorial, the function was not recognized by R, despite uninstalling and reinstalling the package, which may or may not be due to an issue on my end.

shp = sf::st_read(dsn ="C:/shapefile/CaribbeanManagementAreas_po.shp")
## Reading layer `CaribbeanManagementAreas_po' from data source 
##   `C:\shapefile\CaribbeanManagementAreas_po.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3 features and 2 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -68.48189 ymin: 14.92819 xmax: -63.88889 ymax: 21.85425
## Geodetic CRS:  NAD83
plot(shp)
Figure 6. Shapefile of Puerto Rico, St. Thomas, St. John, and St. Croix islands

Figure 6. Shapefile of Puerto Rico, St. Thomas, St. John, and St. Croix islands

#g4 = obtainGBIF("Dendrogyra cylindrus", shp)

spocc package

This package offers the option to select the study area and database for obtaining species distribution data. Here, all databases are presented; however, it’s important to note that the focus is on sponge species XD. It’s worth mentioning that eBird and VertNet data may not provide relevant information as they are specialized for vertebrate species.

g4 = occ(query = "Dendrogyra cylindrus", from =  c('gbif', 'bison', 'inat', 'ebird', 'ecoengine','vertnet', 'obis'), 
    limit = 10000, geometry = shp)
g4
## Searched: gbif, bison, inat, ebird, ecoengine, vertnet, obis
## Occurrences - Found: 80, Returned: 80
## Search type: Scientific
##   inat: Dendrogyra cylindrus (79)
##   vertnet: Dendrogyra cylindrus (1)
## 
## Note: spocc cannot estimate complete additional records found as none available for ebird

To obtain all information with this package, it is necessary to use the functions data.frame and occ2df. Additionally, acquiring data for this species from the VertNet database is uncommon. This package provides data with different column names than those used above.

g4 = occ2df(g4)
head(g4)
## # A tibble: 6 × 6
##   name                 longitude      latitude      prov  date       key      
##   <chr>                <chr>          <chr>         <chr> <date>     <chr>    
## 1 Dendrogyra cylindrus -64.9017938019 18.3617737391 inat  2015-08-01 191491078
## 2 Dendrogyra cylindrus -64.9476047061 18.346176467  inat  2016-02-13 187244096
## 3 Dendrogyra cylindrus -64.8867011932 18.2333751163 inat  2015-12-06 185014583
## 4 Dendrogyra cylindrus -64.8995158985 18.3018340423 inat  2015-12-06 185014063
## 5 Dendrogyra cylindrus -64.9314428336 18.3330347611 inat  2015-11-27 184737334
## 6 Dendrogyra cylindrus -64.8065847244 18.3507824507 inat  2015-10-18 183108013
dim(g4)
## [1] 80  6
g4$longitude = as.numeric(g4$longitude)
g4$latitude = as.numeric(g4$latitude)
g41 = g4[!is.na(g4$longitude) & !is.na(g4$latitude) & !is.na(g4$prov),]
dim(g41)
## [1] 79  6
summary(g41[,2:3])
##    longitude         latitude    
##  Min.   :-68.38   Min.   :17.66  
##  1st Qu.:-64.96   1st Qu.:18.24  
##  Median :-64.88   Median :18.30  
##  Mean   :-65.05   Mean   :18.30  
##  3rd Qu.:-64.75   3rd Qu.:18.35  
##  Max.   :-64.47   Max.   :18.68

Basic plot Dendrogyra cylindrus

plot(g41$longitude, g41$latitude, col = "limegreen", pch = 19, xlab = "Longitude", ylab = "Latitude", xlim = c(-70, -58), ylim = c(19, 15),
     main = expression(paste("Distribution data for ", italic("Dendrogyra cylindrus"))))

legend("topright", legend = expression(italic("D. cylindrus")),
       col = c("limegreen"), pch = 19, box.lwd = 1)

map(add=T)

Representing distribution of Dendrogyra cylindrus using ggplot2, sf, and maps packages

map_data <- map_data("world")

MapG3 = ggplot() +
  geom_point(data = g41, aes(x = longitude, y = latitude, color = name)) +
      coord_cartesian(xlim = c(-70, -58), ylim = c(19, 15)) +
  labs( x = "Longitude", y = "Latitude", color = c("Species"), size = "Conteo") +
  theme_classic() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  theme(text= element_text(size=12, family = "Arial", colour = "black"),
        axis.text.x = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain", colour = "black", size = 17),
        axis.text.y = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17),
        axis.title.x = element_text(angle = 0,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17),
        axis.title.y = element_text(angle = 90,hjust = 0.5, vjust = 0.5, face = "plain",colour = "black", size = 17))+
  borders("world", colour = "black", size= 1) +
  ggtitle(expression(paste("Distribution data for ", italic("Dendrogyra cylindrus")))) +
  theme(plot.title = element_text(hjust = 0.5)) + theme(legend.text = element_text(face = "italic"))

MapG3 +
  geom_polygon(data = map_data, aes(x = long, y = lat, group = group), fill = "grey", color = "black", alpha = 0.5)
Figure 8. Distribution data for Dendrogyra cylindrus available from the inaturalist and vernet databases

Figure 8. Distribution data for Dendrogyra cylindrus available from the inaturalist and vernet databases

Using Python

There are various methods available for retrieving biodiversity information in Python, and it’s also possible to accomplish this in R by employing an approach related to web scraping. Notably, in both Python and R, there exist specialized packages to facilitate this process. In the case of Python, there are at least two notable packages designed for this purpose. In this tutorial, we will explore a specific method utilizing a web scraping function along with the functions provided by the pygbif package. pygbif is a valuable tool that interfaces with the Global Biodiversity Information Facility (GBIF) API, offering a seamless and efficient means of obtaining species occurrence data. By incorporating a web scraping function into the workflow, users can further enhance their ability to gather comprehensive biodiversity information for their projects. This tutorial will provide a step-by-step guide on implementing this combined approach for effective data retrieval

import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pygbif import occurrences

import geopandas as gpd

Web Scraping Function Approach

import requests

def get_data(taxon_name):
    url = f'https://api.inaturalist.org/v1/observations?taxon_name={taxon_name}&quality_grade=research&per_page=100'
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()
    else:
        print(f'Error fetching data: {response.status_code}')
# Data 
taxon_name = 'Panulirus argus'
sp_data = get_data(taxon_name)
type(sp_data)
## <class 'dict'>

Converting dictionary to data frame

FuncDat = pd.json_normalize(sp_data['results'])
type(FuncDat)
## <class 'pandas.core.frame.DataFrame'>
FuncDat.head()
##   quality_grade  ... user.preferences.prefers_project_addition_by
## 0      research  ...                                          NaN
## 1      research  ...                                          NaN
## 2      research  ...                                          NaN
## 3      research  ...                                          NaN
## 4      research  ...                                          NaN
## 
## [5 rows x 136 columns]

Checking columns

NCOL = FuncDat.columns
NCOL1 = list(NCOL)
NCOL1.sort()
print(NCOL1)
## ['annotations', 'cached_votes_total', 'captive', 'comments', 'comments_count', 'community_taxon_id', 'created_at', 'created_at_details.date', 'created_at_details.day', 'created_at_details.hour', 'created_at_details.month', 'created_at_details.week', 'created_at_details.year', 'created_time_zone', 'description', 'faves', 'faves_count', 'flags', 'geojson.coordinates', 'geojson.type', 'geoprivacy', 'id', 'ident_taxon_ids', 'identifications', 'identifications_count', 'identifications_most_agree', 'identifications_most_disagree', 'identifications_some_agree', 'license_code', 'location', 'map_scale', 'mappable', 'non_owner_ids', 'num_identification_agreements', 'num_identification_disagreements', 'oauth_application_id', 'obscured', 'observation_photos', 'observed_on', 'observed_on_details.date', 'observed_on_details.day', 'observed_on_details.hour', 'observed_on_details.month', 'observed_on_details.week', 'observed_on_details.year', 'observed_on_string', 'observed_time_zone', 'ofvs', 'outlinks', 'owners_identification_from_vision', 'photos', 'place_guess', 'place_ids', 'positional_accuracy', 'preferences.prefers_community_taxon', 'project_ids', 'project_ids_with_curator_id', 'project_ids_without_curator_id', 'project_observations', 'public_positional_accuracy', 'quality_grade', 'quality_metrics', 'reviewed_by', 'site_id', 'sounds', 'spam', 'species_guess', 'tags', 'taxon.ancestor_ids', 'taxon.ancestry', 'taxon.atlas_id', 'taxon.complete_species_count', 'taxon.created_at', 'taxon.current_synonymous_taxon_ids', 'taxon.default_photo.attribution', 'taxon.default_photo.flags', 'taxon.default_photo.id', 'taxon.default_photo.license_code', 'taxon.default_photo.medium_url', 'taxon.default_photo.original_dimensions.height', 'taxon.default_photo.original_dimensions.width', 'taxon.default_photo.square_url', 'taxon.default_photo.url', 'taxon.endemic', 'taxon.extinct', 'taxon.flag_counts.resolved', 'taxon.flag_counts.unresolved', 'taxon.iconic_taxon_id', 'taxon.iconic_taxon_name', 'taxon.id', 'taxon.introduced', 'taxon.is_active', 'taxon.min_species_ancestry', 'taxon.min_species_taxon_id', 'taxon.name', 'taxon.native', 'taxon.observations_count', 'taxon.parent_id', 'taxon.photos_locked', 'taxon.preferred_common_name', 'taxon.rank', 'taxon.rank_level', 'taxon.taxon_changes_count', 'taxon.taxon_schemes_count', 'taxon.threatened', 'taxon.universal_search_rank', 'taxon.wikipedia_url', 'taxon_geoprivacy', 'time_observed_at', 'time_zone_offset', 'updated_at', 'uri', 'user.activity_count', 'user.created_at', 'user.icon', 'user.icon_url', 'user.id', 'user.identifications_count', 'user.journal_posts_count', 'user.login', 'user.login_autocomplete', 'user.login_exact', 'user.name', 'user.name_autocomplete', 'user.observations_count', 'user.orcid', 'user.preferences.prefers_observation_fields_by', 'user.preferences.prefers_project_addition_by', 'user.roles', 'user.site_id', 'user.spam', 'user.species_count', 'user.suspended', 'user.universal_search_rank', 'uuid', 'votes']

Obtaining count data from a database

PAcounts = FuncDat['identifications_count']
type(PAcounts)
## <class 'pandas.core.series.Series'>

Obtaining coordinates data from a database

PAcoor = FuncDat['geojson.coordinates']
PAcoor[0:4]
## 0     [-86.742391102, 20.9353787156]
## 1    [-86.8031357601, 20.9219580775]
## 2          [-86.4329452, 16.3593807]
## 3     [-61.7928594014, 12.007440603]
## Name: geojson.coordinates, dtype: object

Extracting longitude data

Longitude = [x[0] for x in PAcoor]
type(Longitude)
## <class 'list'>
Longitude[0:4]
## [-86.742391102, -86.8031357601, -86.4329452, -61.7928594014]

Extracting latitude data

Latitude = [y[1] for y in PAcoor]
type(Latitude)
## <class 'list'>
Latitude[0:4]
## [20.9353787156, 20.9219580775, 16.3593807, 12.007440603]

Preparing to merge data.frame data and list data to create a data frame

PAdat = {'Longitude': Longitude, 'Latitude': Latitude, 'Counts': PAcounts}

Creating data frame

PAdf = pd.DataFrame(PAdat)

Removing NA values from the data frame

PAdf = PAdf.dropna()
PAdf.shape
## (100, 3)

Final data frame

PAdf.head()
##    Longitude   Latitude  Counts
## 0 -86.742391  20.935379       1
## 1 -86.803136  20.921958       1
## 2 -86.432945  16.359381       1
## 3 -61.792859  12.007441       1
## 4 -87.026111  20.343889       2
naturalearth_lowres = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

fig, ax = plt.subplots(figsize=(10, 6))

naturalearth_lowres.plot(ax=ax, color='lightgray', edgecolor='black')
ax.set_xlim([-120, 50])
## (-120.0, 50.0)
ax.set_ylim([-50,60])
## (-50.0, 60.0)
plt.scatter(PAdf['Longitude'], PAdf['Latitude'], color = 'orange', s = 20)
plt.legend(['counts'],  loc='lower right', bbox_to_anchor=(1, 0))
ax.set_xlabel('Longitude', color = 'black', size = 11)
ax.set_ylabel('Latitude', color = 'black', size = 11)

ax.tick_params(axis = 'x', colors = 'black', labelsize = 15)
ax.tick_params(axis = 'y', colors = 'black', labelsize = 15)

title = r'$\mathrm{Distribution}$ $\mathrm{for}$ $\it{Panulirus}$ $\it{argus}$'
ax.set_title(title, color = 'black', size = 15)
plt.show()
Figure 9. Distribution data for Panulirus argus available from the inaturalist database

Figure 9. Distribution data for Panulirus argus available from the inaturalist database

pygbif package

occ_data = occurrences.search(scientificName = 'Periclimenes iridescens')
type(occ_data)
## <class 'dict'>
occ_df = pd.json_normalize(occ_data['results'])
type(occ_df)
## <class 'pandas.core.frame.DataFrame'>
NCOL = occ_df.columns
NCOL2 = list(NCOL)
NCOL2.sort()
print(NCOL2)
## ['acceptedNameUsage', 'acceptedScientificName', 'acceptedTaxonKey', 'accessRights', 'associatedReferences', 'basisOfRecord', 'bibliographicCitation', 'catalogNumber', 'class', 'classKey', 'collectionCode', 'collectionID', 'collectionKey', 'continent', 'coordinateUncertaintyInMeters', 'country', 'countryCode', 'county', 'crawlId', 'dataGeneralizations', 'datasetID', 'datasetKey', 'datasetName', 'dateIdentified', 'day', 'decimalLatitude', 'decimalLongitude', 'depth', 'depthAccuracy', 'disposition', 'distanceFromCentroidInMeters', 'dynamicProperties', 'elevation', 'elevationAccuracy', 'endDayOfYear', 'establishmentMeans', 'eventDate', 'eventID', 'eventRemarks', 'eventTime', 'extensions.http://rs.gbif.org/terms/1.0/Multimedia', 'extensions.http://rs.tdwg.org/ac/terms/Multimedia', 'extensions.http://rs.tdwg.org/dwc/terms/Identification', 'facts', 'family', 'familyKey', 'fieldNumber', 'footprintWKT', 'gadm.level0.gid', 'gadm.level0.name', 'gadm.level1.gid', 'gadm.level1.name', 'gadm.level2.gid', 'gadm.level2.name', 'gadm.level3.gid', 'gadm.level3.name', 'gbifID', 'genericName', 'genus', 'genusKey', 'geodeticDatum', 'georeferenceProtocol', 'georeferenceSources', 'georeferenceVerificationStatus', 'georeferencedBy', 'georeferencedDate', 'habitat', 'higherClassification', 'higherGeography', 'higherGeographyID', 'hostingOrganizationKey', 'http://unknown.org/language', 'http://unknown.org/orders', 'identificationID', 'identificationQualifier', 'identificationRemarks', 'identificationVerificationStatus', 'identifiedBy', 'identifiedByIDs', 'identifier', 'identifiers', 'individualCount', 'installationKey', 'institutionCode', 'institutionID', 'institutionKey', 'isInCluster', 'island', 'islandGroup', 'issues', 'iucnRedListCategory', 'key', 'kingdom', 'kingdomKey', 'language', 'lastCrawled', 'lastInterpreted', 'lastParsed', 'license', 'lifeStage', 'locality', 'locationAccordingTo', 'locationID', 'locationRemarks', 'materialSampleID', 'media', 'modified', 'month', 'municipality', 'nameAccordingTo', 'namePublishedInYear', 'networkKeys', 'nomenclaturalCode', 'occurrenceID', 'occurrenceRemarks', 'occurrenceStatus', 'order', 'orderKey', 'organismID', 'originalNameUsage', 'otherCatalogNumbers', 'ownerInstitutionCode', 'parentNameUsage', 'phylum', 'phylumKey', 'preparations', 'previousIdentifications', 'programmeAcronym', 'projectId', 'protocol', 'publishingCountry', 'publishingOrgKey', 'recordedBy', 'recordedByIDs', 'references', 'relations', 'rightsHolder', 'sampleSizeUnit', 'samplingProtocol', 'scientificName', 'scientificNameID', 'sex', 'species', 'speciesKey', 'specificEpithet', 'startDayOfYear', 'stateProvince', 'taxonID', 'taxonKey', 'taxonRank', 'taxonRemarks', 'taxonomicStatus', 'type', 'typeStatus', 'verbatimCoordinateSystem', 'verbatimDepth', 'verbatimEventDate', 'verbatimLocality', 'verbatimSRS', 'vernacularName', 'waterBody', 'year']
Fcol = ['decimalLongitude','decimalLatitude','occurrenceStatus','datasetName']
type(Fcol)
## <class 'list'>
Focc_df = occ_df[Fcol]
Focc_df.shape
## (224, 4)
Focc_df.head()
##    decimalLongitude  ...                                        datasetName
## 0        -68.969640  ...                                                NaN
## 1        -89.700444  ...  Actualización del conocimiento de la diversida...
## 2        -89.714944  ...  Actualización del conocimiento de la diversida...
## 3        -89.714944  ...                                                NaN
## 4        -89.700444  ...                                                NaN
## 
## [5 rows x 4 columns]
Focc_df.columns
## Index(['decimalLongitude', 'decimalLatitude', 'occurrenceStatus',
##        'datasetName'],
##       dtype='object')
type(Focc_df)
## <class 'pandas.core.frame.DataFrame'>
Focc_df['Values'] = 1
Focc_df['datasetName'].unique()
## array([nan,
##        'Actualización del conocimiento de la diversidad de especies de invertebrados marinos bentónicos de aguas someras (<50m) del Sur del Golfo de México',
##        'NOAA National Benthic Inventory',
##        'Digitalización y Sistematización de las Colecciones Biológicas Nacionales del Instituto de Biología, UNAM',
##        'Colección de Artropodos del Museo de Historia Natural Marina de Colombia - Makuriwa',
##        'CUBAGUA_CRUS',
##        'Crustáceos estomatópodos, anfípodos, isópodos y decápodos del litoral de Quintana Roo',
##        'Computarización de la Colección Nacional de Crustáceos del Instituto de Biología, UNAM y elaboración de su catálogo',
##        'NMNH Extant Biology'], dtype=object)
Focc_df = Focc_df.dropna()
Focc_df.shape
## (80, 5)
Focc_df.head()
##     decimalLongitude  ...  Values
## 1         -89.700444  ...       1
## 2         -89.714944  ...       1
## 12        -89.682272  ...       1
## 14        -89.647583  ...       1
## 15        -90.283028  ...       1
## 
## [5 rows x 5 columns]
# Crear la gráfica
naturalearth_lowres = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

fig, ax = plt.subplots(figsize=(10, 6))

naturalearth_lowres.plot(ax=ax, color='lightgray', edgecolor='black')
ax.set_xlim([-120, 50])
## (-120.0, 50.0)
ax.set_ylim([-50,60])
## (-50.0, 60.0)
plt.scatter(Focc_df['decimalLongitude'], Focc_df['decimalLatitude'], color = 'black', s = 20)
plt.legend(['counts'],  loc='lower right', bbox_to_anchor=(1, 0))
ax.set_xlabel('Longitude', color = 'black', size = 11)
ax.set_ylabel('Latitude', color = 'black', size = 11)

ax.tick_params(axis = 'x', colors = 'black', labelsize = 15)
ax.tick_params(axis = 'y', colors = 'black', labelsize = 15)

title = r'$\mathrm{Distribution}$ $\mathrm{for}$ $\it{Periclimenes}$ $\it{iridescens}$'
ax.set_title(title, color = 'black', size = 15)
plt.show()
Figure 10. Distribution data for Periclimenes iridescens available from the GBIF database

Figure 10. Distribution data for Periclimenes iridescens available from the GBIF database