The 3rd Workshop on Computer Vision in the Wild
-- Building General-Purpose Assistants with Large Multimodal Models
@ CVPR 2024, June 17 || Location: Arch 3B

Overview

A long-standing aspiration in artificial intelligence is to develop general-purpose assistants that can effectively follow users’ (multimodal) instructions to complete a wide range of real-world tasks. Recently, the community has witnessed a growing interest in developing foundation models with emergent abilities of multimodal understanding and generation in open-world tasks. While the recipes of using large language models (LLMs) such as ChatGPT to develop general-purpose assistants for natural language tasks have been proved effective, the recipes of building general-purpose, multimodal assistants for computer vision and vision-language tasks in the wild remain to be explored. Recent works show that learning from large-scale image-text data with human feedback in the loop is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, large multimodal models (LMM) such as Flamingo, GPT-4V, Gemini have demonstrated strong zero-shot transfer capabilities on many vision tasks in the wild. The open-source LMMs have also made significant progress, as demonstrated by OpenFlamingo, MiniGPT4 and LLaVA. These models are trained with visual instruction-following data, where human intents are represented in natural language. On the other hand, interactive vision systems such as Segment Anything (SAM) and SEEM have also shown impressive segmentation performance on almost anything in the wild, where human intents are represented in visual prompts, such as click, bounding boxes and text. These vision models with language and multimodal interfaces are naturally open-vocabulary and even open-task models, showing superior zero-shot performance in various real-world scenarios. We host this “Computer Vision in the Wild (CVinW)” workshop, aiming to gather academic and industry communities to work on CV problems in real-world scenarios, focusing on the challenge of open-world visual task-level transfer. This CVPR 2024 CVinW workshop is a continuation of CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop. For those who are new to this topic, please check out the CVinW Reading List.

The development of LMMs is an emerging new filed, with a vast research exploration space in data collection, modeling, evaluation and new application scenarios. There are many new evaluation and benchmarks merged to measure their performance from different aspects. To advocate established benchmarks to measure the progress, this workshop welcome authors different benchmarks to run indepedent challenges and report results. We highlight a few recent LMM evaluation toolkit / benchmarks:

Schedule (June 17th, Monday)

9:15 AM - 9:30 AM PT		Welcome Jianfeng Gao - Microsoft Research
Morning Session On-Site Chair: Xueyan Zou (UW Madison), Yonatan Bitton (Google)
9:30 AM - 10:00 AM PT		Invited Talk Xiaolong Wang - USCD \| Bio [Expand] Xiaolong Wang is an Assistant Professor in the ECE department at the University of California, San Diego. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. His specific interest lies in learning 3D and dynamics representations from videos and physical robotic interaction data. These comprehensive representations are utilized to facilitate the learning of human-like robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. He is the recipient of the NSF CAREER Award, Intel Rising Star Faculty Award, and Research Awards from Sony, Amazon, Adobe, and Cisco. Spatial Perception and Control in the Wild
10:00 AM - 10:30 AM PT	Spotlight Paper Presentations	* BLINK: Multimodal Large Language Models Can See but Not Perceive * Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering * CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples * What’s in a Name? Beyond Class Indices for Image Recognition * ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts * LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration - A Robot Sous-Chef Application
10:30 AM - 11:00 AM PT		Invited Talk Jack Hessel - Samaya AI \| Bio [Expand] Jack Hessel is a researcher at Samaya AI, a knowledge discovery startup. Previously, he was a postdoc+research scientist at the Allen Institute for AI, and before that, he earned a PhD from Cornell. His work, which spans language processing, machine learning, and computer vision, has been recognized with several awards, including ACL 2023 best paper and EMNLP 2023 outstanding paper. Visual Details Don’t Matter (Until They Do) Abstract [Expand] Visual details are unconsciously filtered by human attention: most things simply aren’t salient most of the time. But, what happens when small details become salient (as they often do), e.g., for a task, for a conversation, etc.? While humans can allocate directed, conscious perception with high fidelity, in this talk, I’ll highlight recent work demonstrating surprisingly simple cases where large vision+language models fall short. Along the way, we’ll encounter a paradox, a curse, and potential paths forward.
11:00 AM - 11:30 AM PT		Invited Talk: Evaluation of LMMs Ziwei Liu - Nanyang Technological University \| Bio [Expand] Prof. Ziwei Liu is currently an Assistant Professor at Nanyang Technological University, Singapore. His research revolves around computer vision, machine learning and computer graphics. He has published extensively on top-tier conferences and journals in relevant fields, including CVPR, ICCV, ECCV, NeurIPS, ICLR, SIGGRAPH, TPAMI, TOG and Nature - Machine Intelligence. He is the recipient of Microsoft Young Fellowship, Hong Kong PhD Fellowship, ICCV Young Researcher Award, HKSTP Best Paper Award, CVPR Best Paper Award Candidate, WAIC Yunfan Award, ICBS Frontiers of Science Award and MIT Technology Review Innovators under 35 Asia Pacific. He serves as an Area Chair of CVPR, ICCV, ECCV, NeurIPS and ICLR, as well as an Associate Editor of IJCV. LMMs-Eval: The Evaluation Suite of Large Multimodal Models [Slides] Abstract [Expand] The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMs-Eval, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMs-Eval offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMs-Eval Lite, a pruned evaluation set that emphasizes both coverage and efficiency. Additionally, we present LiveBench that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs.
11:30 AM - 12:00 PM PT		Invited Talk Guanya Shi - Carnegie Mellon University \| Bio [Expand] Guanya Shi is an Assistant Professor at the Robotics Institute at Carnegie Mellon University (CMU). He completed his Ph.D. in Control and Dynamical Systems in 2022 from Caltech. Before joining CMU, he was a postdoctoral scholar at the University of Washington. He is broadly interested in the intersection of machine learning and control theory, spanning the entire spectrum from theory and foundation, algorithm design, to real-world agile robotics. Guanya was the recipient of several awards, including the Simoudis Discovery Prize and the Ben P.C. Chou Doctoral Prize from Caltech, and the Rising Star in Data Science. Guanya is an Associate Editor of IEEE Robotics and Automation Letters. Unifying Semantic and Physical Intelligence for Generalist Humanoid Robots Abstract [Expand] Humanoid robots offer two unparalleled advantages in general-purpose embodied intelligence. First, humanoids are built as generalist robots that can potentially do all the tasks humans can do in complex environments. Second, the embodiment alignment between humans and humanoids allows for the seamless integration of human cognitive skills with versatile humanoid capabilities. To build generalist humanoids, there are three critical aspects of intelligence: (1) Semantic intelligence (how the robot understands the world and reasons); (2) Physical/Motion intelligence (locomotion and manipulation skills); and (3) Mechanical/Hardware intelligence (how the robot actuates and senses). In this talk, I will present some recent works (H2O, OmniH2O, ABS) that aim to unify semantic and physical intelligence for humanoid robots. In particular, H2O and OmniH2O provide a universal and dexterous interface that enables diverse human control (e.g., VR, RGB) and autonomy (e.g., using imitation learning or VLMs) methods for humanoids, and ABS provides safety guarantees for agile vision-based locomotion control.
Afternoon Session On-Site Chair: Haotian Liu (UW Madison)
1:30 PM - 2 PM PT		Invited Talk Zhe Gan - Apple \| Bio [Expand] Zhe Gan is a Research Scientist and Manager at Apple, where he focuses on developing large-scale vision and multimodal foundation models. Prior to his tenure at Apple, he was a principal researcher at Microsoft Azure AI. He earned his Ph.D. from Duke University in 2018. He has consistently served as an Area Chair at leading AI conferences, including NeurIPS, ICML, ICLR, CVPR, ECCV, ACL, NAACL, and EMNLP. He has also been honored with the Best Student Paper Honorable Mention Awards at CVPR 2021 and WACV 2021. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Abstract [Expand] In this talk, I will discuss building performant Multimodal Large Language Models (MLLMs). In particular, I will discuss the lessons we have learned in developing MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. If time permits, I will also briefly mention our Ferret model family, including the most recent Ferret-UI model.
2:00 PM - 2:30 PM PT		Invited Talk Ludwig Schmidt -- Stanford/Anthropic
2:30 PM - 3:30 PM PT	Poster Session (Afternoon Break)	Poster boards: #1 - #15 (Location: Arch 4E)
3:30 PM - 4:00 PM PT		Invited Talk Yong Jae Lee - University of Wisconsin-Madison \| Bio [Expand] Yong Jae Lee is an Associate Professor in the CS Dept at the University of Wisconsin-Madison. His research interests are in computer vision and machine learning, with a focus on creating robust visual recognition systems that can learn to understand the visual world with minimal human supervision. He is a recipient of several awards including the Army Research Office Young Investigator Award, NSF CAREER Award, Most Innovative Award at the COCO Object Detection Challenge ICCV 2019, and the Best Paper Award at BMVC 2020. Building Steerable Generalist Multimodal Models [Slides] Abstract [Expand] I'll present how to build steerable large multimodal (vision-language) models that can understand human instructions and solve a variety of visual understanding tasks. I'll start by talking about the LLaVA series, and then discuss ways to make LLaVA understand visual prompts. I'll also briefly discuss how to make it more personalized, and more adaptable to varying image complexities. I'll conclude with a discussion on limitations and next steps. Relevant project pages: https://llava-vl.github.io/, https://vip-llava.github.io/, https://thaoshibe.github.io/YoLLaVA/
4:00 PM - 4:30 PM PT		Invited Talk: Evaluation of LMMs Yujie Lu - UC Santa Barbara \| Bio [Expand] Yujie Lu is a third-year PhD student at the University of California, Santa Barbara, working with Prof. William Wang. Her research interests include vision and language models, as well as evaluation and benchmarking. Yujie interned at MSR, AWS AI, and Meta FAIR. She was honored to be elected as the Robert Noyce Fellow. Her work has been published in conferences such as ICCV, ICLR, NeurIPS, EMNLP, NAACL, CoRL, and received the CHI 2023 Best Paper Award. She also co-organized SoCalNLP 2022. WildVision Arena: Evaluating Vision-Language Models in the Wild with Human Preferences Abstract [Expand] Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.

Accepted Papers

Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering Bowen Jiang (University of Pennsylvania)*; Zhijun Zhuang (University of Pennsylvania); Shreyas Skandan Shivakumar (University of Pennsylvania); Dan Roth (UPenn); Camillo Jose Taylor (University of Pennsylvania)
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild Zhiqiang Wang (Florida Atlantic University ); Dejia Xu (University of Texas at Austin)*; Rana Muhammad Shahroz Khan (Vanderbilt University); Yanbin Lin (Florida Atlantic University); Zhiwen Fan (University of Texas at Austin); Xingquan Zhu (Florida Atlantic University)
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual ExamplesJianrui Zhang (University of Wisconsin-Madison)*; Mu Cai (University of Wisconsin-Madison); Tengyang Xie (University of Wisconsin-Madison); Yong Jae Lee (University of Wisconsin-Madison)
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts Mu Cai (University of Wisconsin-Madison)*; Haotian Liu (University of Wisconsin-Madison); Dennis Park (Cruise LLC); Siva Karthik Mustikovela (Uni Heidelberg); Gregory P Meyer (Cruise); Yuning Chai (Alphabet); Yong Jae Lee (University of Wisconsin-Madison)
Rethinking Open-World Semi-Supervised Learning: Distribution Mismatch and Inductive Inference Seongheon Park (Yonsei University)*; Hyuk Kwon (Yonsei University); Kwanghoon Sohn (Yonsei Univ.); Kibok Lee (Yonsei University)
Multimodal Procedural Planning via Dual Text-Image Prompting Yujie Lu (University of California, Santa Barbara)*; Pan Lu (University of California, Los Angeles); Zhiyu Chen (CMU); Wanrong Zhu (University of California, Santa Barbara); Xin Eric Wang (University of California, Santa Cruz); William Yang Wang (UC Santa Barbara)
DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation Zilu Guo (Anhui University, Institutes of Physical Science and Information Technology); Liuyang Bian (Anhui University, Institutes of Physical Science and Information Technology); Huang Xuan (Hefei Institutes of Physical Science, Chinese Academy of Sciences)*; Hu Wei (Hefei Institutes of Physical Science, Chinese Academy of Sciences); Jingyu Li (University of Science and Technology of China); Huasheng Ni (Hefei Institutes of Physical Science, Chinese Academy of Sciences)
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese Yuichi Inoue (Turing Inc.)*; Kento Sasaki (Turing.Inc); Yuma Ochi (Turing In.c); Kazuki Fujii (Tokyo Institute of Technology); Kotaro Tanahashi (Turing Inc.); Yu Yamaguchi (Turing Inc.)
GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation Ci-Siang Lin (National Taiwan University)*; I-Jieh Liu (National Taiwan University); Min-Hung Chen (NVIDIA); Chien-Yi Wang (NVIDIA); Sifei Liu (NVIDIA); Yu-Chiang Frank Wang (National Taiwan University)
What’s in a Name? Beyond Class Indices for Image Recognition Kai Han (The University of Hong Kong); Xiaohu Huang (The University of Hong Kong)*; Yandong Li (Google); Sagar Vaze (Visual Geometry Group, University of Oxford); Jie Li (The University of Hong Kong); xuhui jia (google)
HierGAN: GAN-Based Hierarchical Model for Combined RGB and Depth Inpainting Ankan Dash (New Jersey Institute of Technology)*; Jingyi Gu (New Jersey Institute of Technology); Guiling Wang (New Jersey Institute of Technology)
Investigating Reliable Question Decomposition for Vision-Language Tasks Qian Yang (Mila)*; Weixiang Yan (University of California, Santa Barbara); Aishwarya Agrawal (University of Montreal, Mila, DeepMind)
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Nina Shvetsova (Goethe University Frankfurt); Anna Kukleva (MPII)*; Xudong Hong (Saarland University); Christian Rupprecht (University of Oxford); Bernt Schiele (MPI Informatics); Hilde Kuehne (University of Bonn)
BLINK: Multimodal Large Language Models Can See but Not Perceive Xingyu Fu (University of Pennsylvania)*; Yushi Hu (University of Washington); Bangzheng Li (University of California, Davis); Yu Feng (University of Pennsylvania); Haoyu Wang (University of Pennsylvania); Xudong Lin (Columbia University); Dan Roth (UPenn); Noah A Smith (University of Washington and Allen Institute for AI); Wei-Chiu Ma (Cornell University); Ranjay Krishna (University of Washington)
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations Sri Harsha Dumpala (Dalhousie University/Vector Institute)*; Aman Jaiswal (Dalhousie University); Chandramouli Shama Sastry (Dalhousie University/Vector Institute); Evangelos Milios (Dalhousie University); Sageev Oore (Dalhousie University and Vector Institute); Hassan Sajjad (Dalhousie University)
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration - A Robot Sous-Chef Application Zhe Huang (University of Illinois at Urbana-Champaign)*; John Pohovey (University of Illinois Urbana-Champaign); Ananya Yammanuru (University of Illinois at Urbana Champaign); Katherine Driggs-Campbell (University of Illinois at Urbana-Champaign)
CulturalVQA: Benchmarking Vision Language Models for Cultural Understanding Shravan Nayak (Mila)*; Kanishk Jain (Mila); Karolina Stanczak (Mila - Quebec Artificial Intelligence Institute / McGill University); Md Rabiul Awal (Mila); Aishwarya Agrawal (University of Montreal, Mila, DeepMind)
Segment and Recognize Anything at Any Granularity Feng Li (Hong Kong Univerity of Science and Technology)*; Hao Zhang (hkust); Peize Sun (The University of Hong Kong); Xueyan Zou (university of wisconsin, madison); Shilong Liu (Tsinghua University); Chunyuan Li (Microsoft Research, Redmond); Jianwei Yang (Microsoft Research); Lei Zhang (International Digital Economy Academy (IDEA)); Jianfeng Gao (Microsoft Research)

Call for Papers

Topics of interest include but are not limited to:

LMMs: collection/curation/creation of pre-trainig and instruction-following data, alignment with human intents and modeling.
New metrics / benchmarks / datasets to evaluate LMM, task-level transfer and open-set visual recognition
Unified neural networks architectures and training objectives over different CV & MM tasks
Tool use, external knoweldge and multimodal agents
Open-set visual recognition methods, including classification, object detection, segmentation in images and videos
Efficient large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
Efficien prompting, inference and serving techniques at scale.

Workshop Paper Submission Portal: [CMT]

Dates

Feb, 2024 Competition starts, testing phase begins June 2nd, 2024 Competition ends (challenge paper submission) April 28st, 2024 Workshop paper submission deadline May 26th, 2024 Workshop paper acceptance decision to authors June 2nd, 2024 Camera-ready submission deadline