library(tesseract)
library(magick)
library(stringr)
library(magrittr)Bulk OCR Implementation - Rakha Hafish Setiawan
Activate libraries
Get .png files
Ingest all the available files with Portable Network Graphic (PNG) format within a folder and converting it into a list with their full names intact. Below is the example of the .png files for the optical character recognition demo.
MyFiles = list.files(path = "C://Users//Rakha Hafish S//Desktop//Personal Projects//New folder//OCR",
pattern = ".png",
full.names = TRUE)
FileList = as.list(MyFiles)Process individual files
Read each ingested png files with magick and implement the optical character recognition function from tesseract to convert image-based text into characters.
ProcessedImages = lapply(FileList, magick::image_read)
TheText = lapply(ProcessedImages, tesseract::ocr)Perform data cleaning
After the conversion process, data cleaning is performed to reformat the data structure to a one dimensional array and clean each data from line break (\n) with stringr.
unlist(TheText) %>%
as.vector() %>%
str_remove_all(pattern = "\n")[1] "Recount TextPengertian, Struktur Isi, Unsur Kebahasaan, dan Contohnya"
[2] "Turn Images into Text inthree easy steps B"
[3] "This is the first line ofthis text example.This is the second lineof the same text."
[4] "It was the best oftimes, it was the worstof times, it was the ageof wisdom, it was theage of foolishness..."
[5] "“Report Text:Non-LivingThings\""