Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

Contrastive Region Guidance: Improving Grounding
in Vision-Language Models without Training

UNC Chapel Hill

ECCV 2024

Abstract

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions; this approach has become popular due to the improvement it provides in tasks requiring region-level information. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG's applicability to spatial reasoning, where we obtain up to 10% improvement on the hardest setting of What'sUp, as well as to compositional generalization -- improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe -- and to image-text alignment for generated images, where we improve by up to 8.4 AUROC and 6.8 F1 points on SeeTRUE. For cases that do not have reference regions for the prompt, we also show that CRG allows us to re-rank regions proposed by an object detection model in referring expression comprehension and phrase grounding benchmarks like RefCOCO/RefCOCO+/RefCOCOg and Flickr30K Entities, with an average improvement of 3.2% in accuracy when multiple proposals are available. In our analysis, we explore alternative masking strategies for CRG, demonstrate how CRG impacts the model's probability over relevant text phrases, and evaluate the role of the region guidance strength, empirically validating CRG's design choices.

Method

We introduce Contrastive Region Guidance (CRG), a training-free visual grounding method that guides any VLM to focus on specific regions in an image. For any vision-language models (e.g., LLaVA), we generate output text conditoned on image and input text by:
(1) Blacking out the important regions in the input image to find the model's response without visual evidence from the key region.
(2) Factoring out this bias so that all outputs that do not rely on visual information from the key region will be down-weighted.

On the right side of the first figure, we show applications of CRG to various VL tasks:
(a): When answering a visual question with ROI, CRG guides a VLM to answer about the specific region.
(b): Even when no specific regions are provided, we can leverage an object detector to find important objects and guide the VLM to focus on the objects.
(c): For image-text alignment, CRG guides the model to generate text related to the objects and their relations found in the images, leading to a higher probability for the correct text versus the incorrect text.
(d): CRG can also help VLMs to find the region corresponding to a given text from a set of multiple region proposals by finding the mask that provides the largest contrast.

Evaluation on Visual Prompt Following

We evaluate on ViP-Bench, comprised of 303 image-question pairs specifically designed to comprehensively evaluate visual prompt following capabilities, with six categories: Object Recognition (Rec), OCR, Knowledge (Know), Math , Object Relationship Reasoning (Rel), and Language Generation (Lang).

Takeaways:
1. CRG unlocks visual prompt following, matching fine-tuned models.
2. CRG can also help models fine-tuned with visual prompts.
3. CRG is more helpful to a stronger VLM backbone.
4. Unlike CRG, Set-of-Mark prompting (Yang et al., 2023) is not effective on LLaVA-based models.

Evaluation on Image-Text Alignment (Spatial Understanding, Compositionality, Generated Image Evaluation)

Takeaway: CRG improves spatial understanding in VLMs evaluated on What'sUp dataset, and compositional generalization of VLMs on SugarCrepe's swap-att and swap-obj splits.

Takeaway: CRG helps measure the alignments between text and generated images on SeeTRUE.

Analysis

We investigate the impact of different region guidance strategies, including contrasting original images with another image (e.g., images where different regions are blacked out) and overlaying visual markers (e.g., bounding box and segmentation mask) on the What’sUp benchmark.

Takeaways:
1. Blacking out only the relevant regions is important.
2. Simply overlaying visual markers without CRG is ineffective for pre-trained VLMs.

Examples

Question: What is the value of the variable in the equation within the purple rectangle?
LLaVA-1.6-34B: The equation within the purple rectangle is: 10 + a = 12. To find the value of the variable 'a', you need to solve for 'a'. You can do this by subtracting 10 from both sides of the equation: a = 12 - 10. a = 2. So the value of 'a' in the equation within the purple rectangle is 2.
LLaVA-1.6-34B+CRG (Ours): The equation within the purple rectangle reads: 34 + y = 39. To find the value of the variable 'y', you subtract 34 from both sides of the equation: 34 + y = 39. y = 39 - 34. y = 5. So, the value of 'y' in this equation is 5.

@article{Wan2024CRG, author = {David Wan and Jaemin Cho and Elias Stengel-Eskin and Mohit Bansal}, title = {Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training}, year = {2024}, }