If, for example, your PDF is in French, after you install the corresponding tesseract-ocr-fra, you will run: tesseract -l fra newfile.tiff output pdfĪnd the desired file will be, again, output.pdf. The generated file will be named output.pdf. In the particular case that your original PDF is in Portuguese, you will need this command: tesseract -l por newfile.tiff output pdf If, as in the outdated post, you forget to add alpha -Off, you'll get the following error: Tesseract Open Source OCR Engine v4.0.0-beta.1 with LeptonicaĮrror in pixReadFromTiffStream: spp not in set Run: convert -density 125 originalfile.pdf -depth 8 -alpha Off newfile.tiff If you Google "tesseract PDF" you will probably find this somewhat outdated post. Please make sure the TESSDATA_PREFIX environment variable is set to your Otherwise you'll get the error: Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/por.traineddata For example for Portuguese, you will need to do: sudo apt-get install tesseract-ocr-por If you are going to use a language other than English with tesseract, then you will have to install the corresponding laguage package. Sudo apt-get update & sudo apt-get upgradeĪpt-get install tesseract-ocr -print-uris
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |