Google’s Tesseract is an optical character recognition (OCR) engine with support for unicode and the ability to recognize more than 100 languages. Tesseract works by combining numerous neural net algorithms to recognize text line and character patterns. Now, you can use this power in R with tesseract package! (Note: tesseract is also available in python)

first, install the packages

install.packages("tesseract")
library(tidyverse)
library(tesseract)

The tesseract OCR engine uses language-specific training data in recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language.

By default, tesseract package only provide English training data. You can check it using this code

tesseract_info()
## $datapath
## [1] "C:\\Users\\asus\\AppData\\Local\\tesseract4\\tesseract4\\tessdata/"
## 
## $available
## [1] "eng" "ind" "osd"
## 
## $version
## [1] "4.1.1"
## 
## $configs
##  [1] "alto"             "ambigs.train"     "api_config"       "bigram"          
##  [5] "box.train"        "box.train.stderr" "digits"           "get.images"      
##  [9] "hocr"             "inter"            "kannada"          "linebox"         
## [13] "logfile"          "lstm.train"       "lstmbox"          "lstmdebug"       
## [17] "makebox"          "pdf"              "quiet"            "rebox"           
## [21] "strokewidth"      "tsv"              "txt"              "unlv"            
## [25] "wordstrbox"

Let’s try it out! We want to extract text from this image which captured from an online book

knitr::include_graphics("test_paper.PNG")

tesseract::ocr(image = "test_paper.PNG",
               engine = tesseract("eng"))
## [1] "Talent Engagement\nTo stay competitive, it is paramount to keep your employees fully\nengaged in order to meet and exceed your customers’ expectations\nand achieve your corporate goals. A key component to accomplishing\nthis is monitoring the engagement level of your employee population.\nWe define an engaged employee as happy, enthusiastic, and moti-\nvated, and as an individual who eagerly relishes the challenges of\nhis or her job. Analytics helps to understand the various drivers of\nemployee engagement that deliver happier, more productive work-\ners, and decrease unplanned turnover. It can also help human capital\nmanagement teams sift through data and talent information to better\n"

The extraction runs smoothly. From the github, the only thing to improve the quality of extraction is to use high-quality images. here an example if we provide low-quality image

knitr::include_graphics("odnst.png")

tesseract::ocr(image = "odnst.png",
               engine = tesseract("eng"))
## [1] "TS St /\nPU HUG Pe Hi\n"

To use another language, we need to download the training data first. The code below shows installing “Bahasa” dictionary

tesseract_download("ind")

Let’s try with a different language. The text is taken by screenshot from an online news portal

knitr::include_graphics("ss2.PNG")

textt <- tesseract::ocr(image = "ss2.PNG",
               engine = tesseract("ind")) # you only need to change the engine
cat(textt)
## tirto.id - Tak mudah bagi Abimanyu Prakarsa (33) untuk bisa menata hidup saat harus
## pindah ke Tarakan, Kalimantan Utara. Di kota baru itu, Abimanyu kesulitan mencari
## pekerjaan. Padahal dia baru saja menikah dan harus menghidupi keluarganya.
## “Karena saya baru pindah, jadi tidak punya kenalan, apalagi jaringan, untuk mencari
## pekerjaan yang stabil. Jadi selama berbulan-bulan, saya coba lihat apa yang bisa
## dilakukan untuk menciptakan peluang, tutur Abimanyu.
## 
## Ketika mencari kerja itu, Abimanyu menyadari satu hal: banyak bisnis kuliner online
## yang berkembang pesat. Ia terinspirasi untuk membuat bisnis serupa. Maka dia mulai
## cari resep, dan setelah pas, Abimanyu memulai usaha kuliner kecil-kecilan bernama
## Nasi Kota-KU. Untuk memasarkan dagangannya, Abimanyu memakai media sosial.

Lastly, the engine will works poorly if the text is not horizontally aligned

knitr::include_graphics("roc-intuition.png")

textt2 <- tesseract::ocr(image = "roc-intuition.png",
               engine = tesseract("ind"))
cat(textt2)
## ! Akurasi,
## 1 Ketika kelas balance,
## TN | . .
## masih aman digunakan
## 1
## 0
## 0 FN 5 Fp 1
## ' Akurasi,
## ! Ketika kelas imbalance,
## SN bahaya untuk digunakan
## N
## “Hg TETEPNTNY
## D 05 FP I 1 kondisi yg tidak diinginkan,
## ketika kelas imbalance
## '
## YAN METEPLINN
## A kondisi yg diinginkan,
## FP ketika kelas imbalance

References: