Classification Data Extraction Machine Learning

Is AI in Data Capture real? or: Are templates really a bad thing?

Published on 2022-07-01

Many Capture and RPA vendors claim that they use AI and Machine Learning in their software. Back in the days, everything was “templates” and regular expressions, words that still create negative associations. So none of the vendors use “template” in their marketing language. But are templates really so bad? And how can it be that even a “templated” approach is AI? On the other hand, is AI really useful for data extraction, and aren’t the vendors exaggerating when they use the terms AI and ML?

Let’s talk about these questions.

What is the difference between templates and machine learning when extracting data?

In classic capture software from the early 2000s, software that often still exists today in much-matured versions, templates were the status quo. A template is document type specific. For every document type, you need to create a template and for every template, you need to specify the area or the regular expression of each individual field you want to extract. For invoice processing, that used to mean you need to create one template per vendor. This can cost thousands of hours to set up, and it was usually done by dealing with the high-frequency vendors first. In order to determine the right template for an invoice, rules were used.

Obviously, the classic template approach is unappealing.

Very modern AI-based extraction approaches don’t require templates. Some vendors pre-train vendor-agnostic neural networks for specific document types like invoices. This is only possible if they have thousands of labeled examples. And the model will then only work for invoices and only for the language it was trained on. But this approach isn’t feasible if you want to build a custom data extraction model. To label 10,000 sample documents costs just as much time as creating templates.

Using AI to auto-create templates

A pretty common approach these days is to use Machine Learning to train a classification model. For each class, the software “learns” to extract the data in a known location. If you think about it, this is still a templated approach. The actual data extraction is a fixed zone extraction usually, specific to each document type, which is exactly what a template does. The difference here is that the identification of the correct template is done with machine learning (classification). Instead of writing rules (like “IF VENDOR NAME = AmazonUS THEN USE TEMPLATE Amazon1”), the machine learns to classify the document and assigns it to a class that is associated with a template. Once the template is known, the field data is extracted from known locations like in the classic template approach.

This is a very common technique in modern capture software. It allows the vendors to market it as AI because classification is used to identify the template. It also saves a lot of time because you do not need to pre-train such models and pay huge service costs to do so. Instead, the approach allows auto-learning from corrections made by human reviewers anyways. The system learns as they do their normal work.

It is legitimate to say that auto-templating uses AI. Not all of the steps used to extract the field data are AI, but the approach uses AI. So vendors don’t exaggerate when they claim that. On the other hand, using the word “template”, as long as it is automatically identified, isn’t a bad thing either!

Templates are not always possible

In some cases, you cannot use templates. E.g. if your document types are entirely unstructured and have no layout (like an email, a contract, or a customer letter) there are no templates. Every document looks different. In this case, you have 2 choices:

Rules
AI

If you want to extract e.g. the Effective Date from a contract, you can try rules. The dates can all be found with regular expressions, no problem. But to identify the right date you need to look at the context and keywords around those dates. Every contract will have a different context for the same data value, and the locations of the keyword and date are irrelevant. So you end up writing many rules.

AI is designed for writing these rules for you!

If you train an AI model (e.g. Conditional Random Fields or CNNs are popular AI algorithms for these use cases) all you need to do is select the Effective Date in many example contracts (“label the document”) and let the AI figure out what the unique context around that date is and why other dates in the document are not the Effective Date. So in a way, the AI writes the rules for you. Though of course, you don’t get to see them.

Is pure AI really the best approach?

The auto-template approach described above is very efficient in terms of cost-benefit. But it sure is not as sexy as a “real AI” approach.

As usual, what the best approach is, depends. Your main goal is to spend as little time and money as possible on creating models that yield good extraction accuracy. If your documents are structured forms or semi-structured documents like invoices, purchase orders, etc, auto-templating is a good approach that yields high accuracy once the template is identified. And it can automatically improve (learn) over time.

For unstructured documents, you often have no choice and if the capture software supports building custom AI models for those document types, you are in good hands.

Share on

This post was published in

Classification Data Extraction Machine Learning

and tagged with

AI CNN CRF Classification Machine Learning Rules Templates