ConvBench

ArXiv Paper | Project | Dataset

This repository is the official implementation of ConvBench.

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models
Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao^#, Kaipeng Zhang^#
^# WS (shaowenqi@pjlab.org.cn) and KZ (zhangkaipeng@pjlab.org.cn) are correponding authors.

News🚀🚀🚀

2024/06/13: 🚀We release ConvBench, the first benchmark designed to systematically evaluate the capability of existing LVLMs about multi-turn visual conversation and provide a novel hirarchical ablation evaluation. Experimental results show that performance of existing LVLMs are limited.

Introduction

Multi-turn visual conversation is an important ability of real-world AI assistants. However, the related evaluation benchmark is missed. This paper presents ConvBench, a multi-turn conversation benchmark with hierarchical capabilities ablation evaluation for Large Vision-Language Models (LVLMs). ConvBench comprises 577 curated multi-turn conversations, encompassing 215 tasks. These tasks are broad and open-ended, which resemble real-world user behaviors. ConvBench progressively examines the LVLMs’ perception, reasoning, and creativity capabilities in each conversation and can decouple these capabilities in evaluations and thus perform reliable error attribution. Besides, considering the diversity of open-ended questions, we introduce an efficient and reliable automatic evaluation framework. Experimental results reveal that ConvBench is a significant challenge for current LVLMs, even for GPT4v, which achieves only a 39.51 score. Besides, we have some insightful findings, such as the weak perception of LVLMs inhibits authentic strengths in reasoning and creation. We believe our design of hierarchical capabilities, decoupling capabilities evaluation, and multi-turn conversation can blaze a new trail in LVLMs evaluation.

Main Findings

Based on our benchmark, we conducted a series of experiments. The main findings are summarized as follows:

The most advanced LVLMs (e.g. GPT4V) still struggle to solve the cahllenge provided by ConvBench.
The novel hierarchical ablation evaluations of ConvBench conclude that the weakness of “OCR”, “Fine-grained”, and “Spatial” perception of current LVLMs may inhibit the performance of the next reasoning and creation tasks.
The weakness of LVLMs’ reasoning capability demanding “Professional Knowledge”, “Emotional Intelligence”, “Imagination”, and “Sense of Space” may hinder the performance of the next creation.
The performances across different tasks of different LVLMs show a similar distribution, which suggests the development of current LVLMs is synchronous.
Performance improves as the language model size of LVLM increases.
A declined performance between the first turn and subsequent turns shows that LVLMs tend to generate comprehension biases as the multi-turn conversation progresses or forget the information of previous turns.
The high-quality dialogue history provides important guidance to the LVLMs’ responses and plays an important role in in-context learning examples.

Experimental Results

The performances across different tasks of different LVLMs show a similar distribution, which suggests the development of current LVLMs is synchronous.

Acknowledgement

ConvBench is build upon the documents from VisIT-Bench which is a robust benchmark for diverse real-life vision-language instructions. VLMEvalKit provides useful out-of-box tools and implements many adavanced LVLMs. Thanks for their selfless dedication.

License

The new contributions of our dataset (e.g., the instructions, reference outputs, model ranking annotations, etc.) are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). For the images that were used are same with those from VisIT-Bench, please refer to the public license attached to each individual image in the “public_images_metadata” field in the dataset sheets in VisIT-Bench.

Citation

Please cite the following paper if you feel this repo useful to your research

@article{Liu2024ConvBenchAM,
  title={ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models},
  author={Shuo Liu and Kaining Ying and Hao Zhang and Yue Yang and Yuqi Lin and Tianle Zhang and Chuanhao Li and Yu Qiao and Ping Luo and Wenqi Shao and Kaipeng Zhang},
  journal={ArXiv},
  year={2024},
  volume={abs/2403.20194},
  url={https://api.semanticscholar.org/CorpusID:268793453}
}