Image Processing using R Programming

Muhammad Khan

2020-08-15

Getting Data (In TEXT Form) From Images Using R

Introduction

Finding data from different locations is difficult task in data science. We have lots of data in image form which includes lots of information which are very useful for our analysis work. By this vignette, we learn how to extract data in Text File. In this Vignette we see how to extract data from image which located in our local disk or from any website and also from any PDF. In this Vignette I am using Image of the book “Intro to R” For this task i am using “Tesseract OCR Engine” in R. OCR is use to scan image and recognize text from the image.

This picture has some example text:

This picture has some example text:

Loading and installing Packages

install.packages(“tesseract”)


library(tesseract)

Getting data from locally stored image

Now it’s time to play.

In this firstly we are creating some variables and assigning them extracted data

eng <- tesseract("eng")
text <- tesseract::ocr("C:/Users/Khan/Desktop/AT1/Page 1.png", engine = eng)

once data is extracted now time to display data by below code.

#OUTPUT

cat(text)
#> 1 Introduction and preliminaries
#> 1.1 The R environment
#> R is an integrated suite of software facilities for data manipulation, calculation and graphical
#> display. Among other things it has.
#> © an effective data handling and storage facility,
#> © asuite of operators for calculations on arrays, in particular matrices,
#> a large, coherent, integrated collection of intermediate tools for data analysis,
#> © graphical facilities for data analysis and display either directly at the computer or on hard-
#> copy, and
#> * a well developed, simple and effective programming language (called ‘S’) which includes
#> conditionals, loops, user defined recursive functions and input and output facilities. (Indeed
#> most of the system supplied functions are themselves written in the $ language.)
#> 
#> ‘The term “environment” is intended to characterize it as a fully planned and coherent system,
#> rather than an incremental accretion of very specific and inflexible tools, as is frequently the
#> case with other data analysis software.
#> 
#> R is very much a vehicle for newly developing methods of interactive data analysis. It has
#> developed rapidly, and has been extended by a large collection of packages. However, most
#> programs written in R are essentially ephemeral, written for a single piece of data analysis.
#> 1.2 Related software and documentation
#> R can be regarded as an implementation of the S language which was developed at Bell Labora-
#> tories by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the S-PLus
#> systems.
#> 
#> ‘The evolution of the S language is characterized by four books by John Chambers and
#> coauthors. For R, the basic reference is The New S Language: A Programming Environment
#> ‘for Data Analysis and Graphics by Richard A. Becker, John M. Chambers and Allan R. Wilks.
#> The new features of the 1991 release of S are covered in Statistical Models in S edited by John
#> M. Chambers and Trevor J. Hastie. The formal methods and classes of the methods package are
#> based on those described in Programming with Data by John M. Chambers. See Appendix F
#> [References], page 99, for precise references.

Some extra information

If we want words from the image with confidence rate and bounding box then ocr_data() function returns.

Code is as below.


results <- tesseract::ocr_data("C:/Users/Khan/Desktop/AT1/Page 1.png", engine = eng)

and result is as below.

#OUTPUT

results
#>                  word confidence            bbox
#> 1                   1   94.10359   100,15,111,33
#> 2        Introduction   94.10359   128,15,290,33
#> 3                 and   96.41945   301,15,347,33
#> 4       preliminaries   96.18529   358,15,524,38
#> 5                 1.1   95.13088   100,83,128,98
#> 6                 The   93.29645   142,83,183,98
#> 7                   R   91.15434   192,83,210,98
#> 8         environment   95.89487   219,83,352,98
#> 9                   R   90.41470  88,112,104,136
#> 10                 is   96.39900  98,116,128,128
#> 11                 an   96.94665 135,120,152,128
#> 12         integrated   96.87312 159,116,232,131
#> 13              suite   96.97060 239,117,273,128
#> 14                 of   95.92580 280,116,294,128
#> 15           software   95.92580 300,116,359,128
#> 16         facilities   96.82002 366,116,426,128
#> 17                for   96.93781 433,116,452,128
#> 18               data   96.84337 459,116,492,128
#> 19      manipulation,   96.12000 498,116,598,131
#> 20        calculation   96.44418 606,116,684,128
#> 21                and   96.94547 691,116,717,128
#> 22          graphical   96.57837 724,116,791,131
#> 23           display.   96.30447  98,136,151,151
#> 24              Among   96.60017 160,136,212,151
#> 25              other   96.60017 218,136,255,148
#> 26             things   95.09252 261,136,305,151
#> 27                 it   96.90296 312,137,322,148
#> 28               has.   61.10096 328,136,352,148
#> 29                 ©   53.35981 112,165,119,172
#> 30                 an   24.82700 131,165,149,173
#> 31          effective   24.82700 154,161,213,173
#> 32               data   96.35302 219,161,251,173
#> 33           handling   96.01633 256,161,320,176
#> 34                and   96.99602 325,161,352,173
#> 35            storage   96.97532 357,163,409,176
#> 36          facility,   95.41611 415,161,467,176
#> 37                 ©   72.28120 112,190,119,197
#> 38             asuite   59.38714 131,187,179,198
#> 39                 of   96.93511 185,186,199,198
#> 40          operators   94.93613 204,188,272,201
#> 41                for   96.91566 278,186,297,198
#> 42       calculations   96.52139 303,186,388,198
#> 43                 on   96.95818 394,190,412,198
#> 44            arrays,   91.45399 417,190,465,201
#> 45                 in   94.86411 472,187,486,198
#> 46         particular   94.86411 491,186,563,201
#> 47          matrices,   96.88867 568,187,633,201
#> 48                  a   71.83182 131,215,140,223
#> 49             large,   71.83182 145,211,184,226
#> 50          coherent,   96.26818 190,211,256,226
#> 51         integrated   92.38438 262,211,336,226
#> 52         collection   94.56461 341,211,410,223
#> 53                 of   96.61872 415,211,430,223
#> 54       intermediate   93.76604 435,211,526,223
#> 55              tools   95.28380 531,211,566,223
#> 56                for   96.57478 572,211,591,223
#> 57               data   96.36632 597,211,630,223
#> 58          analysis,   93.64006 635,211,696,226
#> 59                 ©   76.32910 112,240,119,247
#> 60          graphical   72.48493 131,236,198,251
#> 61         facilities   96.73284 203,236,262,248
#> 62                for   96.65889 268,236,287,248
#> 63               data   96.94214 293,236,325,248
#> 64           analysis   96.82896 331,236,387,251
#> 65                and   96.72476 393,236,419,248
#> 66            display   96.72476 425,236,475,251
#> 67             either   95.53992 481,236,522,248
#> 68           directly   96.81799 528,236,582,251
#> 69                 at   96.97424 588,238,602,248
#> 70                the   96.94759 607,236,631,248
#> 71           computer   96.63431 636,238,704,251
#> 72                 or   97.00106 709,240,724,248
#> 73                 on   93.29789 729,240,747,248
#> 74              hard-   89.66180 752,236,790,248
#> 75              copy,   88.66334 131,261,167,272
#> 76                and   96.92912 173,257,200,269
#> 77                  *   24.40579 112,286,119,293
#> 78                  a   95.08904 131,286,140,294
#> 79               well   95.67131 147,282,175,294
#> 80         developed,   96.86376 182,282,257,297
#> 81             simple   96.86035 266,282,312,297
#> 82                and   96.83335 320,282,346,294
#> 83          effective   95.53581 354,282,413,294
#> 84        programming   95.27283 420,283,517,297
#> 85           language   95.38758 524,282,588,297
#> 86            (called   93.19312 596,282,643,298
#> 87           â\200\230Sâ\200\231)   90.94704 652,282,675,298
#> 88              which   96.91003 683,282,725,294
#> 89           includes   92.47908 733,282,791,294
#> 90      conditionals,   92.97797 131,302,222,317
#> 91             loops,   87.84768 227,302,269,317
#> 92               user   92.38862 274,306,304,314
#> 93            defined   92.38862 309,302,360,314
#> 94          recursive   92.64096 365,303,429,314
#> 95          functions   93.85895 434,302,500,314
#> 96                and   96.98881 505,302,531,314
#> 97              input   96.89696 536,303,574,317
#> 98                and   96.94315 580,302,606,314
#> 99             output   96.94315 611,304,659,317
#> 100       facilities.   96.47317 665,302,727,314
#> 101           (Indeed   96.67127 737,302,791,318
#> 102              most   92.79594 131,324,166,334
#> 103                of   95.28929 172,322,186,334
#> 104               the   96.92352 191,322,214,334
#> 105            system   96.83994 220,324,269,337
#> 106          supplied   96.65384 275,322,335,337
#> 107         functions   96.90672 341,322,407,334
#> 108               are   97.00060 413,326,435,334
#> 109        themselves   95.61484 440,322,518,334
#> 110           written   94.38512 524,323,577,334
#> 111                in   93.84140 583,323,596,334
#> 112               the   93.29305 602,322,625,334
#> 113                 $   55.47869 631,322,639,334
#> 114        language.)   94.32716 645,322,719,338
#> 115            â\200\230The   87.50699 121,352,149,364
#> 116              term   96.86191 154,354,188,364
#> 117 â\200œenvironmentâ\200\235   95.11521 194,352,297,364
#> 118                is   96.80374 304,353,314,364
#> 119          intended   94.98958 319,352,381,364
#> 120                to   96.96252 385,354,400,364
#> 121      characterize   96.88525 404,352,491,364
#> 122                it   94.29339 495,353,506,364
#> 123                as   93.99160 511,356,525,364
#> 124                 a   93.99160 530,356,538,364
#> 125             fully   94.22041 542,352,574,367
#> 126           planned   93.04081 578,352,635,367
#> 127               and   96.13908 640,352,666,364
#> 128          coherent   96.13908 671,352,732,364
#> 129           system,   96.80345 737,354,790,367
#> 130            rather   96.83542  98,372,142,384
#> 131              than   94.63019 149,372,183,384
#> 132                an   94.63019 190,376,207,384
#> 133       incremental   96.75865 214,372,299,384
#> 134         accretion   96.79461 306,373,371,384
#> 135                of   96.96443 378,372,393,384
#> 136              very   96.63090 398,376,429,387
#> 137          specific   96.14211 436,372,489,387
#> 138               and   96.86388 496,372,522,384
#> 139        inflexible   96.53220 530,372,594,384
#> 140            tools,   95.67990 601,372,640,387
#> 141                as   95.67990 648,376,663,384
#> 142                is   96.10239 670,373,681,384
#> 143        frequently   96.68401 688,372,761,387
#> 144               the   96.93954 768,372,791,384
#> 145              case   91.15045  98,396,128,404
#> 146              with   96.98829 133,392,166,404
#> 147             other   96.29401 171,392,209,404
#> 148              data   96.92236 215,392,248,404
#> 149          analysis   96.94396 253,392,310,407
#> 150         software.   96.25410 315,392,379,404
#> 151                 R   92.60973 112,408,127,437
#> 152                is   95.85543 121,418,150,430
#> 153              very   96.88503 157,422,188,433
#> 154              much   96.61425 194,418,233,430
#> 155                 a   96.57330 240,422,248,430
#> 156           vehicle   96.57330 254,418,303,430
#> 157               for   97.01272 310,418,329,430
#> 158             newly   96.88533 336,418,378,433
#> 159        developing   96.83430 385,418,461,433
#> 160           methods   94.10760 467,418,529,430
#> 161                of   97.01233 535,418,550,430
#> 162       interactive   96.60221 555,419,631,430
#> 163              data   96.85853 638,418,671,430
#> 164         analysis.   92.59615 677,418,737,433
#> 165                It   93.76827 748,418,760,430
#> 166               has   96.82083 767,418,791,430
#> 167         developed   96.59850  98,438,169,453
#> 168          rapidly,   95.24947 177,438,231,453
#> 169               and   95.24947 240,438,267,450
#> 170               has   96.95512 274,438,298,450
#> 171              been   95.76255 306,438,340,450
#> 172          extended   95.76255 347,438,412,450
#> 173                by   96.30648 420,438,438,453
#> 174                 a   92.66110 434,428,450,457
#> 175             large   92.66110 461,438,496,453
#> 176        collection   95.84861 504,438,573,450
#> 177                of   95.84861 580,438,595,450
#> 178         packages.   93.21439 601,438,667,453
#> 179          However,   95.71909 682,438,747,453
#> 180              most   96.91972 756,440,790,450
#> 181          programs   96.77380  98,462,166,473
#> 182           written   96.64845 171,459,224,470
#> 183                in   93.18479 230,459,244,470
#> 184                 R   92.39890 249,458,262,470
#> 185               are   96.96569 267,462,289,470
#> 186       essentially   96.56873 295,458,369,473
#> 187        ephemeral,   94.98083 375,458,452,473
#> 188           written   95.92316 459,459,512,470
#> 189               for   96.70023 518,458,537,470
#> 190                 a   94.26644 543,462,551,470
#> 191            single   94.26644 557,458,597,473
#> 192             piece   95.31142 603,459,639,473
#> 193                of   97.01031 645,458,659,470
#> 194              data   96.95101 664,458,696,470
#> 195         analysis.   95.98315 702,458,762,473
#> 196               1.2   95.52615 100,501,129,516
#> 197           Related   96.05885 142,501,224,516
#> 198          software   96.33711 233,501,321,516
#> 199               and   96.23862 331,501,369,516
#> 200     documentation   96.15762 378,501,538,516
#> 201                 R   92.27918  92,529,104,553
#> 202               can   96.73250 115,537,140,545
#> 203                be   96.88934 145,533,162,545
#> 204          regarded   96.90101 167,533,230,548
#> 205                as   96.88451 236,537,250,545
#> 206                an   96.90987 255,537,273,545
#> 207    implementation   96.74846 278,533,391,548
#> 208                of   96.82339 396,533,411,545
#> 209               the   93.30180 414,533,437,545
#> 210                 S   84.05531 443,533,451,545
#> 211          language   96.85934 457,533,520,548
#> 212             which   96.96651 525,533,568,545
#> 213               was   96.67907 572,537,599,545
#> 214         developed   96.77591 604,533,675,548
#> 215                at   95.54111 681,535,695,545
#> 216              Bell   93.29832 701,533,729,545
#> 217           Labora-   83.46745 734,533,790,545
#> 218            tories   95.99136  98,554,138,565
#> 219                by   96.98070 143,553,161,568
#> 220              Rick   96.58904 166,553,198,565
#> 221           Becker,   96.79912 204,553,256,568
#> 222              John   96.79912 262,553,297,565
#> 223          Chambers   96.91299 303,553,376,565
#> 224               and   96.79562 381,553,408,565
#> 225             Allan   96.39108 413,553,452,565
#> 226            Wilks,   95.32522 458,553,503,568
#> 227               and   95.32522 510,553,536,565
#> 228              also   95.89719 542,553,569,565
#> 229             forms   95.89719 575,553,614,565
#> 230               the   96.93426 620,553,643,565
#> 231             basis   96.83044 648,553,683,565
#> 232                of   96.42748 689,553,703,565
#> 233               the   93.20888 707,553,730,565
#> 234            S-PLus   89.91541 736,553,791,565
#> 235          systems.   94.17735  98,575,158,588
#> 236            â\200\230The   74.13974 121,599,149,611
#> 237         evolution   96.94698 158,599,225,611
#> 238                of   95.99128 233,599,247,611
#> 239               the   93.20117 254,599,277,611
#> 240                 S   58.09464 286,599,294,611
#> 241          language   95.40086 303,599,367,614
#> 242                is   96.73465 376,600,386,611
#> 243     characterized   96.46387 395,599,491,611
#> 244                by   96.78749 499,599,517,614
#> 245              four   96.17101 525,599,554,611
#> 246             books   96.17101 562,599,604,611
#> 247                by   96.98940 613,599,630,614
#> 248              John   96.72932 639,599,674,611
#> 249          Chambers   96.75008 683,599,756,611
#> 250               and   96.90816 764,599,791,611
#> 251        coauthors.   95.77006  98,619,172,631
#> 252               For   93.30474 184,619,208,631
#> 253                R,   92.87585 216,619,231,634
#> 254               the   97.01650 239,619,262,631
#> 255             basic   96.59161 269,619,305,631
#> 256         reference   96.79369 312,619,376,631
#> 257                is   96.89050 383,620,394,631
#> 258               The   96.85065 404,619,429,631
#> 259               New   93.30556 437,619,468,631
#> 260                 S   86.50048 475,619,485,631
#> 261         Language:   96.52365 492,619,564,634
#> 262                 A   96.52365 574,619,585,631
#> 263       Programming   96.37775 594,619,691,634
#> 264       Environment   96.07622 698,619,792,631
#> 265            â\200\230for   75.27953  97,639,120,654
#> 266              Data   96.45210 125,639,160,651
#> 267          Analysis   96.72615 166,639,227,654
#> 268               and   95.12789 233,639,259,651
#> 269          Graphics   95.12789 266,639,329,654
#> 270                by   95.85399 335,639,352,654
#> 271           Richard   95.85399 358,639,415,651
#> 272                A.   93.71674 421,639,436,651
#> 273           Becker,   96.72398 443,639,495,654
#> 274              John   93.19505 501,639,537,651
#> 275                M.   91.37506 542,639,560,651
#> 276          Chambers   96.55910 567,639,640,651
#> 277               and   95.97836 646,639,672,651
#> 278             Allan   93.30605 678,639,717,651
#> 279                R.   86.27783 723,639,738,651
#> 280            Wilks.   96.57822 745,639,790,651
#> 281               The   92.48247  98,659,127,671
#> 282               new   96.25520 132,663,161,671
#> 283          features   96.79118 167,659,224,671
#> 284                of   96.90687 229,659,244,671
#> 285               the   96.93149 249,659,272,671
#> 286              1991   94.89002 279,660,309,671
#> 287           release   96.99317 316,659,365,671
#> 288                of   92.90592 370,659,385,671
#> 289                 S   52.64245 390,659,398,671
#> 290               are   96.98556 405,663,426,671
#> 291           covered   96.95816 432,659,486,671
#> 292                in   96.90914 492,660,506,671
#> 293       Statistical   96.77760 513,659,584,671
#> 294            Models   96.98988 590,659,639,671
#> 295                in   93.03336 646,660,660,671
#> 296                 S   81.27194 666,659,676,671
#> 297            edited   96.82858 682,659,726,671
#> 298                by   96.64964 732,659,750,674
#> 299              John   96.64964 756,659,791,671
#> 300                M.   67.33022  98,679,117,691
#> 301          Chambers   95.41583 123,679,196,691
#> 302               and   96.91615 201,679,227,691
#> 303            Trevor   96.93314 232,679,280,691
#> 304                J.   96.88129 285,679,296,691
#> 305           Hastie.   96.62265 303,679,352,691
#> 306               The   96.16291 360,679,389,691
#> 307            formal   96.16291 394,679,440,691
#> 308           methods   96.61920 445,679,506,691
#> 309               and   96.65321 512,679,538,691
#> 310           classes   94.63172 543,679,590,691
#> 311                of   96.99500 595,679,610,691
#> 312               the   96.55423 613,679,636,691
#> 313           methods   96.90505 641,679,703,691
#> 314           package   96.50262 707,679,764,694
#> 315               are   96.83183 769,683,791,691
#> 316             based   93.86314  98,699,139,711
#> 317                on   94.85415 145,703,163,711
#> 318             those   96.22873 169,699,207,711
#> 319         described   96.57688 214,699,282,711
#> 320                in   96.70934 289,700,302,711
#> 321       Programming   96.39059 310,699,407,714
#> 322              with   96.96989 413,699,443,711
#> 323              Data   96.97105 450,699,485,711
#> 324                by   96.99678 492,699,510,714
#> 325              John   93.18324 517,699,552,711
#> 326                M.   92.28197 558,699,577,711
#> 327         Chambers.   96.58822 585,699,661,711
#> 328               See   96.63461 673,699,696,711
#> 329          Appendix   91.32970 703,699,773,714
#> 330                 F   85.94184 780,699,791,711
#> 331     [References],   92.53856  99,718,188,735
#> 332              page   96.70426 194,723,227,734
#> 333               99,   93.99168 233,720,253,734
#> 334               for   96.77223 260,719,280,731
#> 335           precise   94.57893 285,720,335,734
#> 336       references.   91.03867 340,719,415,731

Getting text data from online stored image (image taken from Github)

Code is as below

and result is as below

Avaiable Languages for Data extraction

By default this package only show and includes english training data. But windows and mac users can install additional training languages data set using this code tesseract_download(). Aslo you can see the list of languages that your R Studio have installed already. By this command tesseract_info() we can see the list of language currently installed.

Code and result is as below

Alternate or improve image quality with Magick

The athunticty of OCR Process depends on quality of image we select. By Magick we can improve image quality by croping the image where text exists, removing noise. There are many magical functions which magick R packge contains that can use for improve the quality of picture.

Some are here.

  1. “For croping : image_trim() : cropsout whitespace in the margins”
  2. “Noise removal: image_reducenoise() : for automate removalable of noise”
  3. “For rotate : image_rotate() : makes the text in different directions.”
  4. “Brigtness and contrast: image_contrast(): to inhance brithness if there is any issue .”
  5. “Resizeing of image : image_resize() : for setting text size.”

Also if you want more information see improve quality in this there are important tips to improve the quality of your input image.

Getting Data from PDF files

We can also read data from PDF Files. Code below shows locally stored files and online files to extract data from PDF.

Getting data from online Stored PDF file. For online pdf file i am using an dummy pdf file.

Code is as below

and result is as below

Getting data from Locally Stored PDF file.

Code is as below

and result is as below

#OUTPUT

cat(text)
#> 1.3 R and statistics
#> 
#> Our introduction to the R environment did not mention statistics, yet many people use R as a
#> statistics system. We prefer to think of it of an environment within which many classical and
#> modern statistical techniques have been implemented. A few of these are built into the base R
#> environment, but many are supplied as packages. There are about 25 packages supplied with
#> R (called “standard” and “recommended” packages) and many more are available through the
#> CRAN family of Internet sites (via https://CRAN.R-project.org) and elsewhere. More details
#> on packages are given later (see Chapter 13 [Packages], page 77).
#> 
#> Most classical statistics and much of the latest methodology is available for use with R, but
#> users may need to be prepared to do a little work to find it.
#> 
#> Chapter 1: Introduction and preliminaries 3
#> 
#> There is an important difference in philosophy between S (and hence R) and the other
#> 
#> main statistical systems. In S a statistical analysis is normally done as a series of steps, with
#> intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious
#> output from a regression or discriminant analysis, R will give minimal output and store the
#> results in a fit object for subsequent interrogation by further R functions.
#> 
#> 1.4 R and the window system
#> 
#> The most convenient way to use R 1s at a graphics workstation running a windowing system.
#> This guide 1s aimed at users who have this facility. In particular we will occasionally refer to
#> the use of R on an X window system although the vast bulk of what 1s said applies generally to
#> any implementation of the R environment.
#> 
#> Most users will find it necessary to interact directly with the operating system on their
#> computer from time to time. In this guide, we mainly discuss interaction with the operating
#> system on UNIX machines. If you are running R under Windows or macOS you will need to
#> make some small adjustments.
#> 
#> Setting up a workstation to take full advantage of the customizable features of R is a straightforward
#> if somewhat tedious procedure, and will not be considered further here. Users in difficulty
#> should seek local expert help.
#> 
#> 1.5 Using R interactively
#> 
#> When you use the R program it issues a prompt when it expects input commands. The default
#> prompt is ‘>’, which on UNIX might be the same as the shell prompt, and so it may appear that
#> nothing is happening. However, as we shall see, it is easy to change to a different R prompt if
#> you wish. We will assume that the UNIX shell prompt is ‘$’.
#> 
#> In using R under UNIX the suggested procedure for the first occasion is as follows:
#> 
#> 1. Create a separate sub-directory, say work, to hold data files on which you will use R for
#> this problem. This will be the working directory whenever you use R for this particular
#> problem.

Conclusion:

If we compare different tools available for image processing then we come to point that no one is better than R. it is most powerful tool available for image processing. Thus in this paper we are showing how to extract TEXT from image and PDF document there are lots of things to discover in R for image processing.

References:

Some helpful links which I use and for advance image processing things Also i use image of “R for Beginners” book for extracting data.

Alternate or improve quality of Image

Alternate or improve quality of Image

Getting data from image

OCR Engine and Package ‘tesseract’

Thank you forf your precious time