Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Summary

In this paper, the authors have provided a simple but effective way to prompt VLM to do reference free image caption evaluation
First, given the image, the VLM has to generate visual context, these visual contexts refer to objects, their attributes and their relationship with each other
Second, given the image, the candidate caption and the visual context, the LLM has to score from 0 to 100 if the caption is right or not
The authors have shown that using GPT4 as a VLM model can beat the SOTA score

Annotations

Annotation

« Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships »()

Annotation

« In this line of evaluation metrics, InfoMetIC (Hu et al., 2023) relies on the matching of image regions and words in the captions. »()

Annotation

« In this context, “visual context” refers to informationabout objects, attributes, and their relationships,including those in the background and inconspic-uous objects »()

Annotation

Files/VisionLanguageModelbased_2024/image-3-x62-y538.png

Annotation

« We define visual context as the content of an image classified into objects, object attributes (including color, shape, and size), and relationships between objects following the similar notion of scene graphs »(3)

Annotation

« We eliminated these canonical phrases in the postprocessing phase and designated the first integer value in the output sentences as the evaluation score »(3)

Annotation

« In contrast, the other metrics, including our VisCE2, do not depend on such data-specific fine-tuning. »(4)

Annotation

« Pearson correlation for measuring the linear correlation between two sets of data, Kendall’s Tau for measuring the ordinal association between two measured quantities and the accuracy as of the percentage of the correct pairwise ranking between two candidate captions »(4)

Metadata

Date : 02-28-2024

Authors : Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki

Paper Link : http://arxiv.org/abs/2402.17969

Zotero Link: Preprint PDF

Tags : #/unread

Citation : @article{Maeda_Kurita_Miyanishi_Okazaki_2024, title={Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction}, url={http://arxiv.org/abs/2402.17969}, DOI={10.48550/arXiv.2402.17969}, abstractNote={Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between captions of varying quality and content. However, conventional metrics fail short of comparing beyond superficial matches of words or embedding similarities; thus, they still need improvement. This paper presents VisCE $^{2}$ , a vision language model-based caption evaluation method. Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships. By extracting and organizing them into a structured format, we replace the human-written references with visual contexts and help VLMs better understand the image, enhancing evaluation performance. Through meta-evaluation on multiple datasets, we validated that VisCE $^{2}$ outperforms the conventional pre-trained metrics in capturing caption quality and demonstrates superior consistency with human judgment.}, note={arXiv:2402.17969 [cs]}, number={arXiv:2402.17969}, publisher={arXiv}, author={Maeda, Koki and Kurita, Shuhei and Miyanishi, Taiki and Okazaki, Naoaki}, year={2024}, month=feb }

Summary

Annotations

Related Notes