Evaluation of Language-augmented Visual Task-level Transfer.


Various Datasets over Representative Tasks

20 image classification datasets / 35 object detection datasets.


Automatic hyper-parameter tuning; Strong language-augmented efficient adaptation methods

Diverse Knowledge Source

Each dataset concept is augmented with diverse knowledge source include: WordNet, Wiktionary, and GPT3.


To track the research advances in language-image models.


The ELEVATER benchmark is a collection of resources for training, evaluating, and analyzing language-image models on image classification and object detection. ELEVATER consists of:

  • Benchmark: A benchmark suite that consists of 20 image classification datasets and 35 object detection datasets, augmented with external knowledge
  • Toolkit: An automatic hyper-parameter tuning toolkit; Strong language-augmented efficient model adaptation methods.
  • Baseline: Pre-trained languange-free and languange-augmented visual models.
  • Knowledge: A platform to study the benefit of external knowledge for vision problems.
  • Evaluation Metrics: Sample-efficiency (zero-, few-, and full-shot) and Parameter-efficiency.
  • Leaderboard: A public leaderboard to track performance on the benchmark

The ultimate goal of ELEVATER is to drive research in the development of language-image models to tackle core computer vision problems in the wild.

[Quick introduction with slides]

A more diverse set of CV tasks


Please cite our paper as below if you use the ELEVATER benchmark or our toolkit.

    title={ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models},
    author={Li, Chunyuan and Liu, Haotian and Li, Liunian Harold and Zhang, Pengchuan and Aneja, Jyoti and Yang, Jianwei and Jin, Ping and Lee, Yong Jae and Hu, Houdong and Liu, Zicheng and Gao, Jianfeng},
    journal={arXiv preprint arXiv:2204.08790},


Have any questions or suggestions? Feel free to reach us by opening a GitHub issue!