Tesseract OCR in Zend Framework - Tundra blog

Feel the cooling sensation of knowing things...


It is a programming blog btw

Hot

1 Kasım 2017 Çarşamba

Tesseract OCR in Zend Framework

Hi,
this is going to be a short guide about how to start with Tesseract-OCR for php. Tesseract is an open-source OCR engine which is quite competetive. I've tried it for a project and I was surprised by the performance. Then I wanted to let more people know about it.



Anyway, here you can see the repository for Tesseract. I will be going through it with a Zend Framework 2 project that I've been building for some time.

My current setup is PHPStorm + vagrant for my ZF2 projects (currently I use 2017 version of PHPStorm and I use ScotchBox configurator for vagrant)
Alright, let’s cut it short and start doing something.

0- Let's run PHPStorm and run Tools > vagrant > up.

1- Installation for Tesseract is fairly simple. Since it runs on the server side, we will need to install Tesseract-OCR in our vagrant  (Which is basically a Linux VM) Afterwards you have two choices: either you can add the Tesseract PHP library on your project. Or simply execute shell commands with a proper Tesseract commands on your server.

Installation is all about executing few commands in our vagrant machine. To connect our vagrant machine: Tools >  Start SSH Session, and then choose your virtual machine.
After you've opened the terminal for your vagrant machine, you have to execute few commands.

For the Tesseract itself:

sudo apt-get install tesseract-ocr

It will probably ask for confirmation about the installation. Please say yes. However, it is not finished yet. Default Tesseract is coming only with English language. If you want to use any other language than English, the corresponding language pack must be installed additionally.

For any other language family, Tesseract language pack must be also installed. Let’s install Polish language support.

sudo apt-get install tesseract-ocr-[pol]

The parameter is nothing but a country code in ISO 639-2 type. For  a different language, you may look for the language code here.
Installation will again ask for confirmation. Accept and it will continue. After it finishes, you are ready to shoot.

2.a- Usage in command line:
Command line usage of Tesseract is quite simple. Here you can see all possible commands.
A sample piece from a picture of an invoice:
picpiece

A sample command that I've used is:

tesseract data/Faktura.jpg stdout -l pol -c 'output=hocr' hocr

This command reads the 'Faktura.jpg' file, then creates the hocr segments of it. The output image looks like this on the browser:

picpiece3

Hocr is the version of ocr output with html tags and location attributes. It is quite useful when developers try to write templates. Here you can see a snippet of the output for the Faktura.jpg
Faktura VAT  ORYGINAŁ  2017-07-20 Poznań 2017-07-20 

Pieczeć 'irmy data I miejsce wyslawiema dokumentu data sprzedaży 

As you can see, it is a plain HTML which is quite useful. Simply it can be converted to DOMElement object and get attributes and more.

2.b- Usage with PHP Library
We have to include Tesseract PHP Library to our project. You can go to the Tesseract repo and simply add it by using composer. Here is the line you must add to your composer.json file:

"thiagoalessio/tesseract_ocr": "1.0.0-RC"

So far Tesseract library supports sufficient parameters, but not all of them. In the example below Tesseract scans the image with the Polish language support.

$tess = new \TesseractOCR($imagePath);
$tess->lang('pol')->run();

But here is a small thing that you can do for additional parameters. Create a new class which extends the Tesseract library and overload the 'run' method in order to produce 'hocr' output which was not implemented in the library. Like this:

class TesseractCustom extends \TesseractOCR
{
    /**
     * In original php library, we cannot get output with hocr,
     * so we modify it here a little bit
     *
     * @return string
     */
    public function run()
    {
        $req = $this->buildCommand() . ' hocr';
        return trim(`{$req}`);
    }
}

This quick overloading will give us possibility to add whatever parameter we will need. I guess that is it for now. I hope it will be useful for someone :) Good luck developing!

Hiç yorum yok:

Yorum Gönder