Bulk OCR Implementation - Rakha Hafish Setiawan

Activate libraries

library(tesseract)
library(magick)
library(stringr)
library(magrittr)

Get .png files

Ingest all the available files with Portable Network Graphic (PNG) format within a folder and converting it into a list with their full names intact. Below is the example of the .png files for the optical character recognition demo.

MyFiles = list.files(path = "C://Users//Rakha Hafish S//Desktop//Personal Projects//New folder//OCR", 
                     pattern = ".png", 
                     full.names = TRUE)
FileList = as.list(MyFiles)

Process individual files

Read each ingested png files with magick and implement the optical character recognition function from tesseract to convert image-based text into characters.

ProcessedImages = lapply(FileList, magick::image_read)
TheText = lapply(ProcessedImages, tesseract::ocr)

Perform data cleaning

After the conversion process, data cleaning is performed to reformat the data structure to a one dimensional array and clean each data from line break (\n) with stringr.

unlist(TheText) %>% 
  as.vector() %>% 
  str_remove_all(pattern = "\n")
[1] "Recount TextPengertian, Struktur Isi, Unsur Kebahasaan, dan Contohnya"                                      
[2] "Turn Images into Text inthree easy steps B"                                                                 
[3] "This is the first line ofthis text example.This is the second lineof the same text."                        
[4] "It was the best oftimes, it was the worstof times, it was the ageof wisdom, it was theage of foolishness..."
[5] "“Report Text:Non-LivingThings\""