Multimodal Large Language Models for Scientific Document Understanding and Structured Knowledge Extraction

David Park; Mei Zhou; Sophie Laurent

doi:10.55001/faids.v1i2.55

摘要

Scientific literature contains rich knowledge in text, figures, tables, and equations, yet existing information extraction systems process these modalities in isolation. We introduce SciMMLLM, a multimodal large language model fine-tuned on 2.8 million scientific documents spanning 12 disciplines using a novel cross-modal alignment pre-training objective. SciMMLLM jointly encodes document text, embedded figures, and structured tables through a unified transformer architecture with modality-specific adapters. On the SciERC entity-relation extraction benchmark, SciMMLLM achieves F1 of 78.4% (+6.2% over text-only LLMs). For figure caption generation and table-to-text conversion on PubMed Central, it reaches BLEU-4 scores of 42.7 and 38.9 respectively. Applied to systematic review automation, SciMMLLM reduces manual screening time by 73% while maintaining 96.2% sensitivity for relevant paper identification.

作者简介

David Park Allen Institute for AI, Seattle, WA 98103, USA

David Park is an assistant professor at Allen Institute for AI, Seattle, WA 98103, USA. Their research focuses on social sciences, with over 26 publications in peer-reviewed journals.
Mei Zhou National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing 100190, China

Mei Zhou is a professor at National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing 100190, China. Their research focuses on computational science, with over 36 publications in peer-reviewed journals.
Sophie Laurent Inria Centre de Paris, 75013 Paris, France

Sophie Laurent is a professor at Inria Centre de Paris, 75013 Paris, France. Their research focuses on advanced materials, with over 62 publications in peer-reviewed journals.