Rationale
Scientific literature certain life sciences stream such as taxonomy is quite old and hence pdfs of the scans of such papers are available. Using some information as secondary data from such texts is difficult. In case of Taxonomy, Biogeography, locality details are very important to ascertain distribution details of species.
My aim with this work was to extract country locations from such scanned pdfs and map them.
This pipeline can be modified a little so that any specific type of text can be retrived.
## Loading required package: magick
## Linking to ImageMagick 6.9.9.14
## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp
## Disabled features: fontconfig, x11
## Loading required package: tesseract
## Loading required package: quanteda
## Package version: 1.5.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
## Loading required package: DT
## PDF error: No display font for 'ArialUnicode'
## [1] 35
(This function is specifically written to take single pages so that its upto the user to state which pages need search for country names)
## Loading required package: tidytext
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## Warning: 'as.data.frame.dfm' is deprecated.
## Use 'convert(x, to = "data.frame")' instead.
## See help("Deprecated")
What is obtained is a set of words from the page specified (Please note that the word extraction is directly proportional to the quality of the scan)
Words are converted into their roots
## Loading required package: textstem
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
##
## available.koRpus.lang()
##
## and see ?install.koRpus.lang()
##
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
##
## tokenize
## The following objects are masked from 'package:quanteda':
##
## tokens, types
## Loading required package: lexicon
The filtering is run again this time using dictionaries from the lexicon package. This is done in 2 steps.
A set of clean (and lemmatized) words is obtained along with their corresponding count
## Loading required package: RDRPOSTagger
## Loading required package: rJava
## Loading required package: countrycode
## [1] "India" "Mongolia"
From the page specified, two country names have been extracted
A wordcloud visualization is used for the above data using the package wordcloud (The condition here being that there are names of countries in the selected page; Again, this can be modified to extract and visualize other sets of words required as per user requirements)
## Loading required package: wordcloud
## Loading required package: RColorBrewer
If there is no name in the image, the analysis stops here.
If there are names in the data, the countries are visualized using leaflet package. The country GIS data are obtained in two steps (Centroid GIS data has been used in this instance)
## Loading required package: rworldmap
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
## Loading required package: rgeos
## rgeos version: 0.5-2, (SVN revision 621)
## GEOS runtime version: 3.6.1-CAPI-1.10.1
## Linking to sp version: 1.3-2
## Polygon checking: TRUE
## Loading required package: rworldxtra
## Loading required package: leaflet
#Color scheme for the GIS points
colour_combo <- colorFactor(c("orange", "green", "black"),
domain = unique(country_data_gis$name))
#Map
locality_map <- leaflet(country_data_gis) %>%
addProviderTiles("Stamen.Terrain") %>%
addCircleMarkers(
color = ~colour_combo(name),
opacity = 1,
stroke = TRUE,
lng = ~lon,
lat = ~lat,
label = ~as.character(name),
radius = 4)%>%
addTiles()
locality_mapThis is an ongoing project and the next phase is to extract specific localities from the text and highlighting entire countries on the world map