Chat GPT and document processing
Now really (almost) everyone knows a Large Language Model by name. The generative language model ChatGPT is known to the general public. It can answer questions and generate texts - but can it also be used for document processing for in the enterprise or are classical Transformer models (like BERT) better suited? We give answers to the question: What can models of the GPT family do when it comes to information extraction and processing and what not (yet)?
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a subfield of artificial intelligence. It involves developing AI software that analyzes (reads) and produces language at a human level. The technology is used in a variety of applications - for example, chatbots, speech recognition, machine translation and sentiment analysis. And in particular, automated information extraction from documents using NLP technology is a great lever in the digitization and automation of business processes. At the heart of every NLP solution are AI models.
Large Language Models are language models that have been trained to process and/or generate natural language. Deep learning techniques such as neural networks are used to train the models on (very large numbers of) existing texts.
The models relate elements of a text - the text modules - to each other and analyze them in context. The relationships between the data/elements thus serve as a basis for “understanding” their meaning in context.
Generative & classic transformer models
The models used in Natural Language Processing are all language models, some of which have been trained using very large amounts of text (“Large” Language Model), but depending on the model, the way it works and the primary application area is different.
Generative models (e.g. GPT-4) are trained to generate texts. Here, the model calculates a probability distribution - related to possible subsequent words in a sentence or text sequence. Thus, generative models are particularly suitable for generating texts that are similar to the texts from the training set.
Classical transformer models (e.g., BERT) are designed and trained to perform concrete language tasks. The model learns “context tasks” - that is, to understand the meaning of words and sentences in a particular context. For a model to work efficiently, less training data is needed for this purpose, and the model can be trained more easily on specific questions.
In order to be able to practically solve concrete tasks with NLP solutions, the models must be integrated into processes and applications, connected and (in almost all cases) also trained again specifically for the specific task. Only then is the performance in document processing so good that the use is really worthwhile.
GPT stands for Generative Pretrained Transformer. The GPT developed by OpenAI (1-4) is therefore a generative large language model - one of the most advanced in the world.
The main application of GPT is training chatbots (like ChatGPT), as well as text generation, translation and summarization.
In addition, GPT can also be used to classify texts or divide them into certain categories and to extract information from texts.
ChatGPT for process optimization on the document
The hype around ChatGPT makes many responsible persons in the company ask the legitimate question: Can it also be used for document processing?
The answer is: it depends.
The use of generative models - like GPT-4 - can be very useful in certain (sub)areas. However, there are many cases where the use of a specially adapted classical Transformer model makes more sense.
Potential in the use of GPT-4
Training data creation
With the help of GPT-4 training data can be generated automatically. This can be a great asset, because good/many training data is essential for the performance of any NLP model. For example, GPT models can generate paraphrased sentences or texts that are similar to existing texts. Thus, diversity can be increased with synthetic training data. GPT models can also be used for text classification by having the model identify relevant keywords that can serve as labels for classification models.
In addition, the GPT models can be used to support data cleansing as well as augmentation. Thus, you can correct existing data (quality improvement) or generate completely new data sets - which in turn can increase the variance.
Extraction of simple data from short documents
GPT models are trained based on a very large amount of text and can recognize information in context. However, it can be very difficult and time-consuming to reliably extract and evaluate information even from complex texts using generative models. This is because the models always need a “prompt” - i.e., a concrete query or question. Defining this “prompt” in such a way that the model works accurately and that the model really delivers high-quality results is often difficult.
In addition, the computational effort is very high. This results mainly from the model parameters - in each model call, all model parameters must be included in the calculation.
- BERT: 110 Mio. Parameter
- GPT-3: 175 Mrd. Parameter (1.500x as many!)
- GPT-4: not published, but probably again significantly more than GPT-3
When rather use classic transformer models?
If the confidence is to be high and the extraction results accurate, the use of classic transformer models (e.g. BERT) often makes more sense. This is because they are actually designed to identify entities and classify texts.Generative models, on the other hand, are designed to produce new texts and can therefore tend to “hallucinate”. That is, they produce less precise results and have a tendency to produce more “uncontrolled” output. Training generative language models for specific cases requires a significant amount of training data and resources, and is therefore rarely practical.
If the extraction results have to be verifiable by humans in a direct context, the use of classic transformer models is the right decision. Generative models do not return the position of the found elements in the text. This means: It is very time-consuming or even impossible for employees to quickly and directly compare the extracted data with the document (e.g. a scan of a contract).
Generative models do not (yet) recognize page position information when extracting data from documents. Thus, they are not designed to analyze specific layout information or trained to extract structured data from documents. They can recognize information, but have no knowledge of the exact position of these texts within the documents.
However, there are classic transformer models that can process layout information in addition to text (additional model input) and thus recognize and extract information such as table columns, field positions, or paragraph structures.
Many or long documents
Classic transformer models process texts much faster than generative models, because they get by with far fewer parameters. If, for example, a large stock of documents is to be checked for specific criteria or long documents are to be processed in a structured manner, the use of generative models is not recommended - if only for reasons of efficiency.
If the data is to remain on-premise (or must remain for data protection reasons), i.e. cannot be transferred to a cloud environment, classic transformer models are the best choice. They work quickly, require fewer parameters and are therefore more resource-efficient.
Currently, few generative models are even freely available. GPT models, for example, can only be accessed via OpenAI’s API.
The technology behind any practically viable NPL solution should be designed to meet specific requirements.
Structured extraction of information from documents is possible with classic Transformer models as well as with Generative models - however, Generative models, such as GPT-4, are not better suited in every case. If high accuracy is required or if a structured review is to be performed by people, their use is not (yet) recommended. Also, if data is to be extracted layout-dependently or many long documents are to be analyzed, generative models currently still reach their limits. Their use on-premise is hardly possible so far - or does not make much sense.
However, the use of models such as GPT-4 can already be worthwhile, especially for the extraction of simple information from short texts.
Generative models are particularly well suited for optimizing training data - which in turn can be used to efficiently train classical models.
The technology is developing rapidly - generative models perform impressively. In the next few years, they will also play an increasingly important role in document processing.