AI Large Model Topic DocLLM, a multimodal document understanding large language model

Mondo Technology Updated on 2024-02-01

What is shared today is:AI large model seriesIn-depth Research Report:AI Large Model Topic: DOCLM, a large language model for multimodal document understanding

Report Producer: Zheshang**).

Featured Report**: The School of Artificial Intelligence

Enterprise-related documents, such as invoices, receipts, contracts, orders, and **, form an important part of the enterprise corpus. These documents often have complex layouts and custom typography, demonstrating diversity in templates, formats, and quality. While document artificial intelligence (DOCAI) has made great strides in tasks such as extraction, classification, and question answering, there are still significant performance gaps in accuracy, reliability, contextual understanding, and the ability to generalize into uncharted areas in real-world applications. Recently, the JPMorgan AI team, Dongsheng Wct AL, developed DOCLM. The model places special emphasis on spatial structure and avoids the use of complex image encoders. Its architecture incorporates a detached spatial attention mechanism and a unique pre-training strategy, including filler text paragraphs. Docllm has demonstrated superior performance over state-of-the-art language models when it comes to handling the irregular layouts and diverse content common in enterprise documents.

Single-modal models cannot handle multimodal inputs, which in turn rely on complex visual coding. To solve the above problems, DOCllm, proposed by JPMorgan's AI Research team, is a lightweight extension of the standard large language model (LLM) focused on understanding visual richness. Unlike traditional LLMs, DOCLMs model both spatial layout and text semantics. As a result, docllm is multimodal in nature. Spatial layout information is consolidated through the bounding box coordinates of the text markers, which are typically obtained using optical character recognition (OCR), and does not depend on any visual encoder components. As a result, the solution retains the causal decoder architecture, introduces only small increments in the model size, and reduces processing time by not relying on complex visual encoders.

The JP Morgan AI Research team provides two versions of DOCLLM in 1b and 7b, with version 1b based on falcon1b and version 7b based on llama2-7b. The pre-trained parameters of the two base models are used as the basis for the text modal, and the decoupling correlation mechanism and the text block filling target are additionally added. At 16-bit mix, the 7b version used 8 24GB A10G GPUs for training, while the 1b version used only 1 GPU of the same spec.

Not only does docllm excel in the task of understanding visually rich documents, but it also has the potential to change the realm of generative pretraining. It enables language models to move beyond the traditional paradigm of plain text and next-token to accommodate complex layout structures. This means that documents with rich layouts, such as e-books and e-publications, can be directly included in the pre-trained corpus without the need for extensive pre-processing. Similarly, docllm is aware of page breaks and document boundaries, which enhances its ability to understand documents of different lengths. This capability solves the limitations of previous small multimodal models (mainly used for single-page documents) and existing large multimodal language models (mainly designed for images). Overall, the innovative application of docllm in dealing with visually rich documents points to potential directions for future research and improvement.

Featured Report**: The School of Artificial Intelligence

Related Pages