Robust Data Invoice Extraction in English, Russian and Romanian

Billheap.com
2 min readDec 4, 2020

Templatic documents, such as receipts, bills, insurance quotes, and others, are extremely common and critical in a diverse range of business workflows. However current strategies for handling these still utilize a lot of manual work/time or use OCR based heuristics for extraction. In spite of the fact that OCR has been genuinely effective in helping digitization of machine-printed text there are a considerable amount of restrictions in managing structure like information accessible.

Utilizing AI to deal with a structure like information is a difficult task since it includes the utilization of both Computer Vision and NLP. Furthermore, the information contribution to structures need not be normal language and henceforth the NLP calculations must be prepared to manage obscure words. Then most of the work was done to process English, more models must be trained to work with different languages like Romanian and Russian. In this article we will take a gander at the different difficulties engaged with managing dynamic information, how different AI methods can be utilized in tackling the issue alongside comparing code references.

Why Invoice data extraction is hard?

The challenge in this data extraction issue emerges in light of the fact that it is a combination of the common language preparing (NLP) and CV computer vision. Not at all like exemplary NLP undertakings, such reports don’t contain “normal language” as may be found in customary sentences and passages, yet rather take after forms. Information is regularly introduced in tables, however what’s more numerous records have different pages, habitually with a changing number of areas, and have an assortment of design and arranging hints to sort out the data. A comprehension of the two-dimensional design of text on the page is critical to seeing such archives. Then again, treating this absolutely as a picture division issue makes it hard to exploit the semantics of the content.

Machine Learning methods

What about solutions for the Romanian and Russian languages?

BillHeap is a data extraction solution that saves you hours of manual work. It intelligently extracts information from invoices, using artificial intelligence (AI) to recognize invoices, and is compatible with most ERP.

--

--

Billheap.com

BillHeap is a data extraction solution that saves you hours of manual work. It intelligently extracts information from invoices, using artificial intelligence