Processing Image Data to Text With Tesseract
First Install tesseract then load the package.
library(tesseract)
library(magick)
citations:
citation("tesseract")
To cite package ‘tesseract’ in publications use:
Jeroen Ooms (2019). tesseract: Open Source OCR Engine. R package version 4.1.
https://CRAN.R-project.org/package=tesseract
A BibTeX entry for LaTeX users is
@Manual{,
title = {tesseract: Open Source OCR Engine},
author = {Jeroen Ooms},
year = {2019},
note = {R package version 4.1},
url = {https://CRAN.R-project.org/package=tesseract},
}
citation("magick")
To cite package ‘magick’ in publications use:
Jeroen Ooms (2020). magick: Advanced Graphics and Image-Processing in R. R package
version 2.3. https://CRAN.R-project.org/package=magick
A BibTeX entry for LaTeX users is
@Manual{,
title = {magick: Advanced Graphics and Image-Processing in R},
author = {Jeroen Ooms},
year = {2020},
note = {R package version 2.3},
url = {https://CRAN.R-project.org/package=magick},
}
input the image with a read function
input <- image_read("C:/Users/Owner/Downloads/IMG_20200423_165352__01__01.jpg")
input
plot(input)
Format and porcess image to text
myText <- input %>%
image_resize("2000x") %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr()
cat(myText)
pes BABYR MOMAGE WEDYR WBYRS MOWEDAGE NOPREG HISTPREG
2 84 38 69 15.6 23 6 2.6
7 85 37 67 17.4 20 3 5.8
10 85 37 68 16.7 21 6 2.8
44 85 34 70 14.8 20 5 3.0
66 85 31 74 11.6 20 0 0.0
73 85 35 72 13.4 22 7 1,9
76 85 37 70 15.4 22 9 447
78 85 43 64 21.3 22 11 1,9
80 85 36 69 15.7 21 10 1.6
87 85 a 70 15.2 22 3 3.4
91 85 33 73 12.0 21 3 4.0
| 102 85 38 68 LEO 21 12 1.4
103 85 36 70 15.4 21 9 L7
105 86 37 70 1535 22 7 2.2
107 86 39 66 19:57 20 12 1.6
111 86 31 75 10.6 21 10 r,t
116 86 34 73 13.2 21 5 BiG
117 86 37 70 15.6 22 8 2.0
121 86 41 67 18.3 23 10 1.8
134 86 38 70 16.6 22 6 2.8
139 86 38 70 15.8 23 9 1.8
146 86 39 70 15.6 24 9 1%
149 86 40 68 18.0 22 8 2.3
150 86 39 66 19.8 20 9 2.2
154 86 36 74 12.0 24 5 2.4
165 86 45 63 22st 23 10 2.3
170 86 34 72 14.5 20 9 1.6
184 86 40 70 16.1 24 10 £6
194 86 40 67 19).7 21 11 1.8
212 86 35 72 13.8 22 9 5
215 87 36 73 14.1 22 6 2.4
226 87 39 69 Lol 22 10 1:7
232 87 37 72 14.8 23 8 r9
238 87 36 70 16.4 20 10 1.6
243 87 34 1D 12.0 22 11 1
247 87 34 73 1339 21 4 3.5"
248 87 31 76 10.5 21 5 2.1
256 87 32 75 11.8 21 8 5
266 87 33 75 11.9 22 9 Ar 3
268 87 35 73 13.6 22 6 28
272 87 41 69 18.0 23 8 2.3
274 87 36 71 16.5 20 2 8.3
288 87 31 77 10.1 » a 5 2.0
290 87 35 73 14.1 21 6 2.4
294 87 31 75 11.9 20 6 2.0
308 87 30 77 10.5 20 5 21
318 87 42 69 18.4 24 11 fled
332 87 38 70 16.4 22 7 2.3
339 88 34 75 12.8 22 7 1.8
341 88 34 76 11.9 23 ik. Lot
342 88 33 i 12.9 21 5 2.6
344 88 33 76 11.0 22 3 S751
save text
write.table(myText, file = "convertedTExt1.txt", sep = " ", row.names = T, col.names = NA)
I opened and cleaned the text file then saved it as a CSV. now i will try to read in that CSV.
#EEMH_JSN_10PlusYrsMrd <- read.csv("~/OilMartetAnalysis/EEMH_JSN_10PlusYrsMrd.csv", sep="")
EEMH_JSN_10PlusYrsMrd <- read.csv("~/OilMartetAnalysis/convertedTExt1.CSV", sep="")
SUCCESS!!!!!!!
I HAVE A DATA FRAME FROM PAPER!
EEMH_JSN_10PlusYrsMrd
save data for export because something is wrond with the base csv file:
write.csv(x = EEMH_JSN_10PlusYrsMrd, file = "EEMH_JSN_10PlusYrsMrd.csv")
Using the Tesseract, and Magick packages I was able to create a data frame from a table on paper.
The data presented represents the cohort of 10 + Years Married.
Including variables:
OBS:
#UNKNOWN#
BABYR:
EEMH delivery year.
MOMAGE:
Mother’s age
WEDYR:
#UNKNOWN#
WBYRS:
Years married at birth.
MOWEDAGE:
Mother’s age at wedding.
NOPREG:
Number of pregnancies.
HISTPREG:
Pregnancies per year.