The 3rd Workshop on Computer Vision in the Wild
-- Building General-Purpose Assistants with Large Multimodal Models
@ CVPR 2024, June 17 || Location: Arch 3B

Overview

A long-standing aspiration in artificial intelligence is to develop general-purpose assistants that can effectively follow users’ (multimodal) instructions to complete a wide range of real-world tasks. Recently, the community has witnessed a growing interest in developing foundation models with emergent abilities of multimodal understanding and generation in open-world tasks. While the recipes of using large language models (LLMs) such as ChatGPT to develop general-purpose assistants for natural language tasks have been proved effective, the recipes of building general-purpose, multimodal assistants for computer vision and vision-language tasks in the wild remain to be explored. Recent works show that learning from large-scale image-text data with human feedback in the loop is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, large multimodal models (LMM) such as Flamingo, GPT-4V, Gemini have demonstrated strong zero-shot transfer capabilities on many vision tasks in the wild. The open-source LMMs have also made significant progress, as demonstrated by OpenFlamingo, MiniGPT4 and LLaVA. These models are trained with visual instruction-following data, where human intents are represented in natural language. On the other hand, interactive vision systems such as Segment Anything (SAM) and SEEM have also shown impressive segmentation performance on almost anything in the wild, where human intents are represented in visual prompts, such as click, bounding boxes and text. These vision models with language and multimodal interfaces are naturally open-vocabulary and even open-task models, showing superior zero-shot performance in various real-world scenarios. We host this “Computer Vision in the Wild (CVinW)” workshop, aiming to gather academic and industry communities to work on CV problems in real-world scenarios, focusing on the challenge of open-world visual task-level transfer. This CVPR 2024 CVinW workshop is a continuation of CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop. For those who are new to this topic, please check out the CVinW Reading List.

The development of LMMs is an emerging new filed, with a vast research exploration space in data collection, modeling, evaluation and new application scenarios. There are many new evaluation and benchmarks merged to measure their performance from different aspects. To advocate established benchmarks to measure the progress, this workshop welcome authors different benchmarks to run indepedent challenges and report results. We highlight a few recent LMM evaluation toolkit / benchmarks:


Schedule (June 17th, Monday)

9:15 AM - 9:30 AM PT
Welcome
Jianfeng Gao - Microsoft Research

Morning Session On-Site Chair: Xueyan Zou (UW Madison), Yonatan Bitton (Google)

9:30 AM - 10:00 AM PT
Invited Talk

Xiaolong Wang - USCD
| Bio [Expand]
Xiaolong Wang is an Assistant Professor in the ECE department at the University of California, San Diego. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. His specific interest lies in learning 3D and dynamics representations from videos and physical robotic interaction data. These comprehensive representations are utilized to facilitate the learning of human-like robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. He is the recipient of the NSF CAREER Award, Intel Rising Star Faculty Award, and Research Awards from Sony, Amazon, Adobe, and Cisco.


Spatial Perception and Control in the Wild
10:00 AM - 10:30 AM PT

Spotlight Paper Presentations

* BLINK: Multimodal Large Language Models Can See but Not Perceive
* Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
* CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
* What’s in a Name? Beyond Class Indices for Image Recognition
* ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
* LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration - A Robot Sous-Chef Application

10:30 AM - 11:00 AM PT
Invited Talk

Jack Hessel
- Samaya AI | Bio [Expand]
Jack Hessel is a researcher at Samaya AI, a knowledge discovery startup. Previously, he was a postdoc+research scientist at the Allen Institute for AI, and before that, he earned a PhD from Cornell. His work, which spans language processing, machine learning, and computer vision, has been recognized with several awards, including ACL 2023 best paper and EMNLP 2023 outstanding paper.


Visual Details Don’t Matter (Until They Do)
Abstract [Expand]
Visual details are unconsciously filtered by human attention: most things simply aren’t salient most of the time. But, what happens when small details *become* salient (as they often do), e.g., for a task, for a conversation, etc.? While humans can allocate directed, conscious perception with high fidelity, in this talk, I’ll highlight recent work demonstrating surprisingly simple cases where large vision+language models fall short. Along the way, we’ll encounter a paradox, a curse, and potential paths forward.
11:00 AM - 11:30 AM PT
Invited Talk: Evaluation of LMMs

Ziwei Liu - Nanyang Technological University
| Bio [Expand]
Prof. Ziwei Liu is currently an Assistant Professor at Nanyang Technological University, Singapore. His research revolves around computer vision, machine learning and computer graphics. He has published extensively on top-tier conferences and journals in relevant fields, including CVPR, ICCV, ECCV, NeurIPS, ICLR, SIGGRAPH, TPAMI, TOG and Nature - Machine Intelligence. He is the recipient of Microsoft Young Fellowship, Hong Kong PhD Fellowship, ICCV Young Researcher Award, HKSTP Best Paper Award, CVPR Best Paper Award Candidate, WAIC Yunfan Award, ICBS Frontiers of Science Award and MIT Technology Review Innovators under 35 Asia Pacific. He serves as an Area Chair of CVPR, ICCV, ECCV, NeurIPS and ICLR, as well as an Associate Editor of IJCV.


LMMs-Eval: The Evaluation Suite of Large Multimodal Models
[Slides]
Abstract [Expand]
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMs-Eval, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMs-Eval offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMs-Eval Lite, a pruned evaluation set that emphasizes both coverage and efficiency. Additionally, we present LiveBench that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs.
11:30 AM - 12:00 PM PT
Invited Talk

Guanya Shi - Carnegie Mellon University
| Bio [Expand]
Guanya Shi is an Assistant Professor at the Robotics Institute at Carnegie Mellon University (CMU). He completed his Ph.D. in Control and Dynamical Systems in 2022 from Caltech. Before joining CMU, he was a postdoctoral scholar at the University of Washington. He is broadly interested in the intersection of machine learning and control theory, spanning the entire spectrum from theory and foundation, algorithm design, to real-world agile robotics. Guanya was the recipient of several awards, including the Simoudis Discovery Prize and the Ben P.C. Chou Doctoral Prize from Caltech, and the Rising Star in Data Science. Guanya is an Associate Editor of IEEE Robotics and Automation Letters.


Unifying Semantic and Physical Intelligence for Generalist Humanoid Robots
Abstract [Expand]
Humanoid robots offer two unparalleled advantages in general-purpose embodied intelligence. First, humanoids are built as generalist robots that can potentially do all the tasks humans can do in complex environments. Second, the embodiment alignment between humans and humanoids allows for the seamless integration of human cognitive skills with versatile humanoid capabilities. To build generalist humanoids, there are three critical aspects of intelligence: (1) Semantic intelligence (how the robot understands the world and reasons); (2) Physical/Motion intelligence (locomotion and manipulation skills); and (3) Mechanical/Hardware intelligence (how the robot actuates and senses). In this talk, I will present some recent works (H2O, OmniH2O, ABS) that aim to unify semantic and physical intelligence for humanoid robots. In particular, H2O and OmniH2O provide a universal and dexterous interface that enables diverse human control (e.g., VR, RGB) and autonomy (e.g., using imitation learning or VLMs) methods for humanoids, and ABS provides safety guarantees for agile vision-based locomotion control.

Afternoon Session On-Site Chair: Haotian Liu (UW Madison)

1:30 PM - 2 PM PT
Invited Talk

Zhe Gan - Apple
| Bio [Expand]
Zhe Gan is a Research Scientist and Manager at Apple, where he focuses on developing large-scale vision and multimodal foundation models. Prior to his tenure at Apple, he was a principal researcher at Microsoft Azure AI. He earned his Ph.D. from Duke University in 2018. He has consistently served as an Area Chair at leading AI conferences, including NeurIPS, ICML, ICLR, CVPR, ECCV, ACL, NAACL, and EMNLP. He has also been honored with the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021.


MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Abstract [Expand]
In this talk, I will discuss building performant Multimodal Large Language Models (MLLMs). In particular, I will discuss the lessons we have learned in developing MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. If time permits, I will also briefly mention our Ferret model family, including the most recent Ferret-UI model.
2:00 PM - 2:30 PM PT
Invited Talk

Ludwig Schmidt -- Stanford/Anthropic

2:30 PM - 3:30 PM PT

Poster Session (Afternoon Break)

Poster boards: #1 - #15 (Location: Arch 4E)
3:30 PM - 4:00 PM PT
Invited Talk

Yong Jae Lee - University of Wisconsin-Madison
| Bio [Expand]
Yong Jae Lee is an Associate Professor in the CS Dept at the University of Wisconsin-Madison. His research interests are in computer vision and machine learning, with a focus on creating robust visual recognition systems that can learn to understand the visual world with minimal human supervision. He is a recipient of several awards including the Army Research Office Young Investigator Award, NSF CAREER Award, Most Innovative Award at the COCO Object Detection Challenge ICCV 2019, and the Best Paper Award at BMVC 2020.


Building Steerable Generalist Multimodal Models [Slides]
Abstract [Expand]
I'll present how to build steerable large multimodal (vision-language) models that can understand human instructions and solve a variety of visual understanding tasks. I'll start by talking about the LLaVA series, and then discuss ways to make LLaVA understand visual prompts. I'll also briefly discuss how to make it more personalized, and more adaptable to varying image complexities. I'll conclude with a discussion on limitations and next steps. Relevant project pages: https://llava-vl.github.io/, https://vip-llava.github.io/, https://thaoshibe.github.io/YoLLaVA/
4:00 PM - 4:30 PM PT
Invited Talk: Evaluation of LMMs

Yujie Lu - UC Santa Barbara
| Bio [Expand]
Yujie Lu is a third-year PhD student at the University of California, Santa Barbara, working with Prof. William Wang. Her research interests include vision and language models, as well as evaluation and benchmarking. Yujie interned at MSR, AWS AI, and Meta FAIR. She was honored to be elected as the Robert Noyce Fellow. Her work has been published in conferences such as ICCV, ICLR, NeurIPS, EMNLP, NAACL, CoRL, and received the CHI 2023 Best Paper Award. She also co-organized SoCalNLP 2022.


WildVision Arena: Evaluating Vision-Language Models in the Wild with Human Preferences
Abstract [Expand]
Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.

Accepted Papers

Call for Papers

Topics of interest include but are not limited to:

  • LMMs: collection/curation/creation of pre-trainig and instruction-following data, alignment with human intents and modeling.
  • New metrics / benchmarks / datasets to evaluate LMM, task-level transfer and open-set visual recognition
  • Unified neural networks architectures and training objectives over different CV & MM tasks
  • Tool use, external knoweldge and multimodal agents
  • Open-set visual recognition methods, including classification, object detection, segmentation in images and videos
  • Efficient large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
  • Efficien prompting, inference and serving techniques at scale.

  • We accept abstract submissions to our workshop. All submissions shall have maximally 8 pages (excluding references) following the CVPR 2024 author guidelines. All submissions will be reviewed by the Program Committee on the basis of technical quality, relevance to scope of the conference, originality, significance, and clarity. Ther review process is double-blind, and the accepted papers are NOT archived in CVPR proceeding.

    Workshop Paper Submission Portal: [CMT]


Dates

Feb, 2024 Competition starts, testing phase begins
June 2nd, 2024 Competition ends (challenge paper submission)
April 28st, 2024 Workshop paper submission deadline
May 26th, 2024 Workshop paper acceptance decision to authors
June 2nd, 2024 Camera-ready submission deadline


Workshop Organizers



Chunyuan Li
ByteDance / TikTok



Jianwei Yang
Microsoft



Haotian Liu
UW Madison



Xueyan Zou
UW Madison



Wanrong Zhu
UCSB



Yonatan Bitton
Google



Jianfeng Gao
Microsoft


Workshop and Challenge Questions?
Reach out: https://github.com/Computer-Vision-in-the-Wild/cvpr-2024
Workshop Organizing Team