A robust training process for NLP models


This is an excerpt from the guest article “Architecture for serial process automation with Natural Language Processing (NLP)” in the topic dossier No. 21/2021 of Versicherungsforen Leipzig. Click here to download the full article as PDF (in German).

Properly and consistently applied, Natural Language Processing is a technology that can be used to automate knowledge-based processes and generate long-term strategic benefits.

The difference between a quick, successful implementation and a never-ending proof of concept lies not in model training, but in the selection, conversion, cleansing, creation, and annotation of training data.

The biggest levers for a shorter time to market here are: a robust training process, the experience of the data scientists involved, and the optimal tool.

A best practice training process usually consists of five clearly defined steps.

1. Preparation and synthesis of training data

Existing documents are converted into a suitable format - the appropriate OCR engine with layout recognition also plays a major role here. The converted documents are cleaned, deduplicated and a suitable selection for a balanced training is made.

Especially when only a few sample documents are available or the structures to be recognized are semi-regular, synthesizing training data with specialized generative grammars is a good choice. This procedure allows thousands of training data with a reasonable variance to be generated within a short time - and the following step is not necessary at all.

2. Manual annotation of some examples

By using transfer learning, an initial model can be trained today with as few as 50-150 annotated examples. Annotation is best done in cooperation between the data team and the business department. This ensures that the optimal training basis is created within a few days.

3. Active learning

The model obtained in this way is then applied to unannotated raw data. A set of 30 to 50 documents is selected from the model’s predictions and checked and, if necessary, corrected by data scientists or business experts.

The selection is based on two criteria: where is the model uncertain (“uncertainty sampling”) and which documents are very different from the previous training data (“diversity sampling”). This ensures that the selected documents provide a maximum learning effect when they are trained again - so-called “active learning”.

In a few short iteration loops, the model is thus brought to an optimal prediction quality - a process that otherwise often requires many times the time.

4. Deployment

The model obtained in this way is usually ready for use in production after a few weeks. Modern training solutions offer automatic deployment in a standardized format and often already provide connectors for commercially available IT systems. The model can be integrated into backend systems, workflow tools and even front ends via a standardized API.

5. Continuous improvement

During operation, the model is validated with a so-called “human in the loop”: predictions for which the confidence falls below a predefined threshold are passed on to a human processor for verification in order to avoid errors.

The corrections of the “human in the loop” are then used to fine-tune the model - thus creating a continuous improvement process.

Read the full article as PDF here (in German).

The AI system for intelligent document processing

Completely integrated into your work process, the AI software kinisto can support employees - previously manual work steps on the document are automatically done by the AI. As a specialist for AI technology in practical use, we will be happy to advise you on the topic and on your specific case!

More about kinisto Contact