The 3rd Workshop on Computer Vision in the Wild
-- Building General-Purpose Assistants with Large Multimodal Models
@ CVPR 2024, June 17


A long-standing aspiration in artificial intelligence is to develop general-purpose assistants that can effectively follow users’ (multimodal) instructions to complete a wide range of real-world tasks. Recently, the community has witnessed a growing interest in developing foundation models with emergent abilities of multimodal understanding and generation in open-world tasks. While the recipes of using large language models (LLMs) such as ChatGPT to develop general-purpose assistants for natural language tasks have been proved effective, the recipes of building general-purpose, multimodal assistants for computer vision and vision-language tasks in the wild remain to be explored. Recent works show that learning from large-scale image-text data with human feedback in the loop is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, large multimodal models (LMM) such as Flamingo, GPT-4V, Gemini have demonstrated strong zero-shot transfer capabilities on many vision tasks in the wild. The open-source LMMs have also made significant progress, as demonstrated by OpenFlamingo, MiniGPT4 and LLaVA. These models are trained with visual instruction-following data, where human intents are represented in natural language. On the other hand, interactive vision systems such as Segment Anything (SAM) and SEEM have also shown impressive segmentation performance on almost anything in the wild, where human intents are represented in visual prompts, such as click, bounding boxes and text. These vision models with language and multimodal interfaces are naturally open-vocabulary and even open-task models, showing superior zero-shot performance in various real-world scenarios. We host this “Computer Vision in the Wild (CVinW)” workshop, aiming to gather academic and industry communities to work on CV problems in real-world scenarios, focusing on the challenge of open-world visual task-level transfer. This CVPR 2024 CVinW workshop is a continuation of CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop. For those who are new to this topic, please check out the CVinW Reading List.

The development of LMMs is an emerging new filed, with a vast research exploration space in data collection, modeling, evaluation and new application scenarios. There are many new benchmarks merged to measure their performance from different aspects. To advocate established benchmarks to measure the progress, this workshop welcome authors different benchmark to run indepedent challenges and report results. Initially, it will also host two challenges:

Call for Papers

Topics of interest include but are not limited to:

  • LMMs: collection/curation/creation of pre-trainig and instruction-following data, alignment with human intents and modeling.
  • New metrics / benchmarks / datasets to evaluate LMM, task-level transfer and open-set visual recognition
  • Unified neural networks architectures and training objectives over different CV & MM tasks
  • Tool use, external knoweldge and multimodal agents
  • Open-set visual recognition methods, including classification, object detection, segmentation in images and videos
  • Efficient large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
  • Efficien prompting, inference and serving techniques at scale.

  • We accept abstract submissions to our workshop. All submissions shall have maximally 8 pages (excluding references) following the CVPR 2024 author guidelines. All submissions will be reviewed by the Program Committee on the basis of technical quality, relevance to scope of the conference, originality, significance, and clarity. Ther review process is double-blind, and the accepted papers are NOT archived in CVPR proceeding.

    Workshop Paper Submission Portal: [CMT]


Feb, 2024 Competition starts, testing phase begins
June 2nd, 2024 Competition ends (challenge paper submission)
April 28st, 2024 Workshop paper submission deadline
May 19th, 2024 Workshop paper acceptance decision to authors
June 2nd, 2024 Camera-ready submission deadline

Workshop Organizers

Chunyuan Li
ByteDance / TikTok

Jianwei Yang

Haotian Liu
UW Madison

Xueyan Zou
UW Madison

Wanrong Zhu

Yonatan Bitton

Jianfeng Gao

Workshop and Challenge Questions?
Reach out:
Workshop Organizing Team