Text mining from scanned documents

Specific libraries used for the analysis
A sample scanned pdf used for this work.
A function to get the tokens from scanned pdfs
Lemmatization of the selected words
A filter to remove meaningless words from the data
- Generating the data required to obtain the necessary words
- Obtaining the clean words
Extraction of names(proper nouns)
Wordcloud of the result
Obtaining and visualizing the countries

Rationale

Scientific literature certain life sciences stream such as taxonomy is quite old and hence pdfs of the scans of such papers are available. Using some information as secondary data from such texts is difficult. In case of Taxonomy, Biogeography, locality details are very important to ascertain distribution details of species.

My aim with this work was to extract country locations from such scanned pdfs and map them.

This pipeline can be modified a little so that any specific type of text can be retrived.

Specific libraries used for the analysis

## Loading required package: magick

## Linking to ImageMagick 6.9.9.14
## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp
## Disabled features: fontconfig, x11

## Loading required package: tesseract

## Loading required package: quanteda

## Package version: 1.5.2

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

## Loading required package: DT

A sample scanned pdf used for this work.

## PDF error: No display font for 'ArialUnicode'

## [1] 35

A function to get the tokens from scanned pdfs

(This function is specifically written to take single pages so that its upto the user to state which pages need search for country names)

## Loading required package: tidytext

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

## Warning: 'as.data.frame.dfm' is deprecated.
## Use 'convert(x, to = "data.frame")' instead.
## See help("Deprecated")

What is obtained is a set of words from the page specified (Please note that the word extraction is directly proportional to the quality of the scan)

Lemmatization of the selected words

Words are converted into their roots

## Loading required package: textstem

## Loading required package: koRpus.lang.en

## Loading required package: koRpus

## Loading required package: sylly

## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()

## 
## Attaching package: 'koRpus'

## The following object is masked from 'package:readr':
## 
##     tokenize

## The following objects are masked from 'package:quanteda':
## 
##     tokens, types

## Loading required package: lexicon

A filter to remove meaningless words from the data

The filtering is run again this time using dictionaries from the lexicon package. This is done in 2 steps.

Generating the data required to obtain the necessary words

Obtaining the clean words

A set of clean (and lemmatized) words is obtained along with their corresponding count

Extraction of names(proper nouns)

## Loading required package: RDRPOSTagger

## Loading required package: rJava

## Loading required package: countrycode

## [1] "India"    "Mongolia"

From the page specified, two country names have been extracted

Wordcloud of the result

A wordcloud visualization is used for the above data using the package wordcloud (The condition here being that there are names of countries in the selected page; Again, this can be modified to extract and visualize other sets of words required as per user requirements)

## Loading required package: wordcloud

## Loading required package: RColorBrewer

If there is no name in the image, the analysis stops here.

If there are names in the data, the countries are visualized using leaflet package. The country GIS data are obtained in two steps (Centroid GIS data has been used in this instance)

Obtaining and visualizing the countries

Obtaining the country GIS values

## Loading required package: rworldmap

## Loading required package: sp

## ### Welcome to rworldmap ###

## For a short introduction type :   vignette('rworldmap')

## Loading required package: rgeos

## rgeos version: 0.5-2, (SVN revision 621)
##  GEOS runtime version: 3.6.1-CAPI-1.10.1 
##  Linking to sp version: 1.3-2 
##  Polygon checking: TRUE

## Loading required package: rworldxtra

Combining the centroid data with the country data

Data Visualization

if(!require(leaflet))install.packages('leaflet')

## Loading required package: leaflet

#Color scheme for the GIS points


colour_combo <- colorFactor(c("orange", "green", "black"),
                            domain = unique(country_data_gis$name))


#Map

locality_map <- leaflet(country_data_gis) %>%
    addProviderTiles("Stamen.Terrain") %>%
    addCircleMarkers(
        color = ~colour_combo(name),
        opacity = 1,
        stroke = TRUE,
        lng = ~lon, 
        lat = ~lat,
        label = ~as.character(name),
        radius = 4)%>%
    addTiles()
    

locality_map

This is an ongoing project and the next phase is to extract specific localities from the text and highlighting entire countries on the world map