Objectives

The Tesseract

The tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results. For more details, go to https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html.

Loading the libraries

library(tesseract)
library(magick) 
library(plotly) 

Creating an engine

if(is.na(match("spa", tesseract_info()$available)))
  tesseract_download("spa")
#tesseract_engine <- tesseract(language='spa', datapath = NULL, configs = NULL,
# cache = TRUE , options =list(tessedit_char_whitelist ="abcdefghijklmnñopqrstuvwxyzABCDEFGHIJKLMNÑOPQRSTUVWXYZ0123456789 \n!," )

tesseract_engine <- tesseract(datapath = NULL, configs = NULL, cache = TRUE)

Let’s try to extract the text from this meme

image_file <- "memes/covid3.JPG" 
text <- ocr( image_file, engine = tesseract_engine )
writeLines( text )
. eo
A
, A

The accuracy of the OCR process depends on the quality of the input image. You can often improve results by properly scaling the image, removing noise and artifacts or cropping the area where the text exists. See tesseract wiki: improve quality for important tips to improve the quality of your input image

The awesome magick R package has many useful functions that can be use for enhancing the quality of the image.

Let’s load the image

image_original <- image_read( image_file )

Preprocessing the image

Using different tricks, let’s save each different possible texts with the mean of the confidence obtained by the tesseract

x <- c()
y <- c()
index <- c()
text <- c()
sizes <- seq(from = 100, to= 4000, by= 100 )
for( i in 1:length(sizes) ){
  size <- paste0( as.character( sizes[i] ), 'x')  
  #print( paste('Testing with image_resize(', size, ')' ) ) 
  image_magick <- image_original %>%
    image_resize( size ) %>%
    image_convert(type = 'Grayscale') %>%
    image_trim(fuzz = 30) %>%
    image_enhance()
  
  text[i] <-tesseract::ocr( image = image_magick, engine = tesseract_engine )
  ocr.data <- tesseract::ocr_data( image_magick, engine = tesseract_engine)
  
  index[i] <- i 
  x[i] <- sizes[i]
  y[i] <- mean(ocr.data$confidence, na.rm = TRUE )
}
data.points= data.frame( index=index, x = x, y=y, text= text )

The Plot

Let’s make a plot to compare each confidence Vs each size

g <- ggplot( data= data.points, aes(x=x, y=y) ) + 
  geom_point( color='green' ) + 
  geom_line(size=1, color='green')+
  theme(legend.position="top",
        axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(title = "Confidence Vs sizes",
       y = "Confidence", x = "Sizes")

ggplotly(g)

The Output

Let’s print the best possible text

data.points.max =data.points %>% filter( y == max(y))
data.points.max
  index    x        y                                     text
1    28 2800 87.47066 EI 2020 lo descargaron\ncon Ares!\n6 »\n
writeLines( as.character(data.points.max$text) )
EI 2020 lo descargaron
con Ares!
6 »
# free up some memory
rm(x, y, index, text, ocr.data, sizes, g, data.points.max, image_file, image_original, image_magick, tesseract_engine )

Final Words