One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
要約
External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrie…