I want to reorganize and digitize the LSE’s Charles Booth Notebooks.
I need to import sections of the notebook as JPGs, additionally sketches of maps are presented. I will use other methods for other media, etc.
I will try to read the text with the Tessearct package, but I might have trouble with that because the handwriting is messy.
Load Packages:
library(tesseract)
## Warning: package 'tesseract' was built under R version 3.6.3
library(magick)
## Warning: package 'magick' was built under R version 3.6.2
## Linking to ImageMagick 6.9.9.14
## Enabled features: cairo, freetype, fftw, ghostscript, lcms, pango, rsvg, webp
## Disabled features: fontconfig, x11
import image of text
NoteBook1Pg2 <- image_read(path = "C:/Users/Owner/Pictures/BoothBookPg1And2.jpg")
view the image
plot(NoteBook1Pg2)
I want to isolate page 2. The right of the book’s spine.
NoteBook1Pg2
croppedImage <- image_crop(NoteBook1Pg2, geometry = geometry_area(x_off = 870))
plot(croppedImage)
I have page 2. now I want to read it’s text.
clean and enhance the image
myImage <- image_convert(image = croppedImage, type = "grayscale")
plot(myImage)
myImage2 <- image_trim(image = myImage, fuzz = 30)
plot(myImage2)
now attempt to read the text
MyText <- ocr(image = myImage2, engine = tesseract())
cat(MyText)
## f [Shek Hatha mprarel nS"
## ik II l e neler ec th) >
## eet wer
## | Rrambrtatn tht Sur Giles Cre Pile “gale
## ee He ee ec . H, D2
## t. Ath, Wel the Incl krel
## Y Safer iit
## Bele Wak Fie, Rod’:
## ' Wet fy Nennaher. tar RA
## | rh eA
## Satin. . ba ckaok.. 4
## | he 6
## | Fh. Dah Pal
## ficuny Int ton KeWorbteck ©)
## als thgh fiat poplar te eel
## frdice Colt in Ke Uret Sua dnl
## | ee ihe. fh isl,
## | tt dre, Arka , etetrkd Mey
## pet! of [oper
## \ Sedo oe re bebe Hi Behe Re on ctr