研究论文
Multimodal Large Language Models for Scientific Document Understanding and Structured Knowledge Extraction
摘要
Scientific literature contains rich knowledge in text, figures, tables, and equations, yet existing information extraction systems process these modalities in isolation. We introduce SciMMLLM, a multimodal large language model fine-tuned on 2.8 million scientific documents spanning 12 disciplines using a novel cross-modal alignment pre-training objective. SciMMLLM jointly encodes document text, embedded figures, and structured tables through a unified transformer architecture with modality-specific adapters. On the SciERC entity-relation extraction benchmark, SciMMLLM achieves F1 of 78.4% (+6.2% over text-only LLMs). For figure caption generation and table-to-text conversion on PubMed Central, it reaches BLEU-4 scores of 42.7 and 38.9 respectively. Applied to systematic review automation, SciMMLLM reduces manual screening time by 73% while maintaining 96.2% sensitivity for relevant paper identification.