Document Classification Explained

Published on 2023-03-01

Understanding what a document is about

Quick and accurate distribution of incoming mail or email in an organization is essential. An incoming Request for Proposal (RFP) needs to get to the Sales department, a job application should be processed by the correct hiring manager quickly, and insurance claims need to be assigned to the correct department and agent to be processed in a timely fashion.

Intelligent Capture Software often has built-in document classification tools that can make this task simple to set up.

The old days

In the old days, documents came in via mail, fax, or email, and often they reached a central recipient address. That address was the mail room, the info@… address, or the general fax number. People monitored these inroads and manually distributed or forwarded the documents to the right people. This requires lots of people and the work is error-prone as the people have to know all potential final recipients and departments and need to infer that correctly from reading or skimming the document.

“Inferring the type of document from skimming it”? That is exactly what Classification, a mature machine learning technology, does.

Before automatic classification got established, there were earlier ways of automating incoming documents. This could involve reading barcodes or performing OCR and applying a massive rules engine. This rules engine could have rules like “If the word ‘invoice’ is on the document then send it to the accounting team”. But the rule needs to account for OCR errors and spelling variations of ‘invoice’ like ‘bill’ or abbreviations like “Inv.” or even other languages. It also needs to exclude the term “invoice” from complaints for example, which are handled by the customer support organization, not by the accounting team. If a customer complains about being billed too much, the email or letter should not go to the accounting department just because the word “invoice” occurs in the text.

Now imagine having 200 different types of documents and destinations. Rules can become complex and unmanageable very quickly.

Modern technologies for document classification

On a high level, modern classification technology for documents comes in 2 flavors: based on the image only (e.g. a PDF or a scanned document) or based on content (the text of the document or email).

Classifying documents based on the image alone is often very fast because no OCR is needed. However, it may not always be applicable. E.g. if you need to distinguish between 2 different similar-looking forms that only differ in minor detail, you need more than just a “glance at the image”. Or if you need to classify emails, social media posts, or SMS, these “documents” have no visual component and cannot be classified by “looking at them”. Instead, you need to have the text and classify them based on the content.

The classification algorithms for image-based approaches are often a kNN (k-nearest-neighbor) or neural network, as these are well suited for handling images. They have been around for a long time and are very far advanced. Of course, neural networks are also very well suited to classify images, but often they are not the best tool for business documents because they need too many training samples and have other limitations that make them unsuitable in practice.

Classifying documents based on text can be done with Naive Bayes classifiers or Support Vector machines in the simplest case, or modern neural network approaches based on Large Language Models (LLM), or ready-made classifiers like FastText from the Facebook AI Labs. For text classification, the “feature selection” part of the approach is often more important than selecting the actual classifier algorithm. In other words, what features(words, phrases, tokens) exactly end up being important to learn from matters more than how to learn them (for a machine at least).

4 things to watch out for when selecting a vendor for document classification

When selecting an Intelligent Document Automation vendor for your custom classification problem, there are a few important things to consider.

Do they have a no-code platform or do you need a programmer or even a data scientist to train and run the classification models?

Some IDP products are so simple to use that all you need to do is drag & drop your sample documents for each class into the user interface and push a “build the AI models” button. Other vendors require to communicate with web services and parse JSON or even curate the data, this can require coders or even a data scientist.

How many labeled samples do you need to train the model?

Neural networks for image processing are notoriously data-greedy. Some neural networks for computer vision like face recognition or self-driving cars are trained on billions of images. And that isn’t just billions of random images, but labeled samples. This means a person has manually flagged every sample with “what it is supposed to be classified as”.

If the capture software you look at employs a modern-sounding neural network for classification, you should ask the vendor how many samples it needs to be trained with. Several hundred or thousands is pretty common for these approaches.

A kNN or Support Vector Machine can be trained with less than 10 or up to 100 samples. This can make a big difference in the cost and duration of setting up the system.

Does the model automatically get improved as the system operates?

All ML-enabled capture products use machine learning tech that allows you to train the system before you go to production. But as your staff fixes the errors (assign the document to the right class if the machine made an incorrect prediction), does the machine improve the model automatically? Some ML algorithms don’t support incrementally improving the model and some do, so this is an important question to ask the vendor. If the model does not improve automatically, making improvements means bringing in a consultant to add more samples and releasing a new, improved version of the model to production.

Performance

How long does it take to train the ML model and how long does the classification of a document take? Modern Intelligent Document Automation products should be able to build the model within minutes even for large numbers of classes and samples. And that should not require a massive array of GPUs but rather work on a desktop computer.

The runtime classification of a single-page document should be well under a second. If it doesn’t meet these timings, you should question the technology and see if it is suitable for your use cases.

We hope these points help you to ask the vendor some detailed questions that allow you to select them or not.

Share on

This post was published in

Classification

and tagged with

Classification Neural Networks OCR kNN