Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Summary

Annotations

Annotation

« Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships »()

Annotation

« In this line of evaluation metrics, InfoMetIC (Hu et al., 2023) relies on the matching of image regions and words in the captions. »()

Annotation

« In this context, “visual context” refers to informationabout objects, attributes, and their relationships,including those in the background and inconspic-uous objects »()

Annotation

Files/VisionLanguageModelbased_2024/image-3-x62-y538.pngFiles/VisionLanguageModelbased_2024/image-3-x62-y538.png

Annotation

« We define visual context as the content of an image classified into objects, object attributes (including color, shape, and size), and relationships between objects following the similar notion of scene graphs »(3)

Annotation

« We eliminated these canonical phrases in the postprocessing phase and designated the first integer value in the output sentences as the evaluation score »(3)

Annotation

« In contrast, the other metrics, including our VisCE2, do not depend on such data-specific fine-tuning. »(4)

Annotation

« Pearson correlation for measuring the linear correlation between two sets of data, Kendall’s Tau for measuring the ordinal association between two measured quantities and the accuracy as of the percentage of the correct pairwise ranking between two candidate captions »(4)


Related Notes