Seeing Beyond Hallucinations: LLM-based Compositional Information Extraction for Multimodal Reasoning
Published in SIGIR, 2025
Advancements in Multimodal Large Language Models (MLLMs) have significantly improved information extraction and retrieval performance. Despite these achievements, MLLMs still suffer from the visual object hallucination problem, where models produce plausible, yet incorrect, or irrelevant content not present in the input data. This issue arises from an over-reliance on ‘‘bag-of-objects’’ representations and language priors, leading to inadequate extraction of visual objects, along with their attributes and relationships. Existing methods to mitigate these hallucinations are limited by the significant human labor required and the coarse-grained nature. To overcome these challenges, we introduce Multimodal Contrastive Decoding (MMCD), a novel decoding approach that integrates graph-structured reasoning paths with contrastive decoding. MMCD mitigates object hallucinations induced by language priors and …