Document AI, or Document Intelligence, is a relatively new research topic thatrefers to the techniques for automatically reading, understanding, and analyzingbusiness documents. It is an important research direction for natural languageprocessing and computer vision. In recent years, the popularity of deep learningtechnology has greatly advanced the development of Document AI, such as document layout analysis, visual information extraction, document visual questionanswering, document image classification, etc. This paper briefly reviews some ofthe representative models, tasks, and benchmark datasets. Furthermore, we alsointroduce early-stage heuristic rule-based document analysis, statistical machinelearning algorithms, and deep learning approaches especially pre-training methods. Finally, we look into future directions for Document AI research.
5.1.3 DOCUMENT IMAGE CLASSIFICATION Document image classication refers to the task of classifying document images that is essential for business digitalization. RVL-CDIP (Harley et al., 2015) is a representative dataset for this task. The dataset contains 400,000 grayscale images in 16 document image categories. Tabacco-3482 (Kumar et al., 2014) selects a subset of RVL-CDIP for evaluation, which contains 3,482 grayscale document images.
id: c57aff3404afd502107bfafbc37cab08 - page: 12
Document image classication is a special subtask of image classication, thus classication models for natural images can also address the problem of document image classication. Afzal et al. (2015) introduce a document image classication method based on CNN for document image classication. To overcome the problem of insufcient samples, they use Alexnet trained with ImageNet as the initialization for model adaptation on document images. Afzal et al. (2017) use GoogLeNet, VGG, ResNet and other successful models from natural images on document images through transfer learning. Through the adjustment of model parameters and data processing, Tensmeyer & Martinez (2017) use the CNN model that can outperform the previous models without transfer learning from natural images. Das et al. (2018) propose a new convolutional network based on different image regions for document image classication. This method classies different regions of the document separately, and nally merges multiple cl
id: 3877138a8d450b275ce71a925256a1fa - page: 12
Sarkhel & Nandi (2019) extract features at different levels by introducing a pyramidal multi-scale structure. Dauphinee et al. (2019) obtain the text of the document by performing OCR on the document image, and combine image and text features to further improve the classication performance.
id: 7e85182e6b9d2bf824ad7b7e9c285fc7 - page: 12
5.1.4 DOCUMENT VISUAL QUESTION ANSWERING Document Visual Question Answering (VQA) is a high-level understanding task for document images. Specically, given a document image and a related question, the model needs to give the correct answer to the question based on the given image. A specic example is shown in Figure 5. VQA for documents rst appears in the DocVQA dataset (Mathew et al., 2021b), which contains more than 12,000 documents and corresponding 5,000 questions. Later, InfographicVQA (Mathew et al., 2021a) is also proposed, which is a VQA benchmark for infographic images in the documents. As the answers in DocVQA are relatively short and topics are not diverse, some researchers also proposed the VisualMRC (Tanaka et al., 2021) dataset for the document VQA task, which includes long answers with diverse topics.
id: 05d408287d31d67ec88d10103597c0b7 - page: 12