The 5th Workshop on Computer Vision in the Wild
Theme: Building Multimodal AI Agents with Verbal, Spatial and Temporal Intelligence
Date: June 3-4 | Location: Denver Convention Center, Denver CO

Overview

As artificial intelligence continues to evolve, the intersection of vision, language, and action is becoming central to systems that must operate reliably in unconstrained, real world settings. The 5th Workshop on Computer Vision in the Wild (CVinW) at CVPR 2026 aims to bring together researchers and practitioners advancing multimodal AI agents that can perceive, reason, and act across digital and physical environments, while highlighting the capabilities where today’s models still fall short. Building on the success of our previous workshops: CVPR 2025 CVinW Workshop, CVPR 2024 CVinW Workshop, CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop, this year’s edition focuses on the intersection of large multimodal models (LMMs) and vision-language-action (VLA) systems, with particular emphasis on fine grained spatiotemporal reasoning, causal inference, long horizon planning and memory, and robust tool use. The workshop focuses on moving beyond static understanding toward agents that perceive, reason, and act in dynamic environments, including interactive digital settings and embodied physical interaction.



header

Image source: Vision-Language Pre-training: Basics, Recent Advances, and Future Trends and Multimodal Foundation Models: From Specialists to General-Purpose Assistants


Over the past years, we have witnessed remarkable advancements in open-vocabulary visual comprehension models and multimodal learning. Five years ago, vision-language models or multimodal models are mostly built on top of the BERT architecture. Typically, these models contain less than 1B parameters, and trained with a small amount of images. Some representative works are like ViLBERT, UNITER, and VisualBERT, etc. They are mostly used for image-text matching tasks such as visual question answering (VQA) and image captioning. Later on, we have seen the emergence of multimodal vision foundation models, such as CLIP, ALIGN and Florence. It scaled up the multimodal training to billions of images. Despite the model size is still relatively small, it shows strong open-vocabulary and zero-shot recognition capability across a wide range of visual domains. These strong capabilities have been further transferred to fine-grained core vision tasks such as object detection (e.g., M-DETR, ViLD, GLIP, RegionCLIP, GroundingDINO, OWL-ViT, etc), image segmentations (e.g., X-Decoder, SegGPT, SEEM, SAM, LISA, etc). Most recently, we entered the era of large multimodal models. Connecting the multimodal vision models such as CLIP with large language models such as Flamingo, Gemini, GPT-4V, leading to many advanced multimodal capability. Now we can have a multimodal chatbot such as GPT-4o, LLaVA-OneVision, Qwen-2.5-VL and Phi-4-Multimodal, which can see, talk and reasoning

Despite these successes, current vision models still lack the ability to fully grasp temporal dynamics, causal reasoning, and embodied interactions—key elements for autonomous agents that can see, reason, and act. Some recent works have attempted to address these challenges by building agentic models and VLA models. Our workshop aims to bringing together leading researchers, practitioners, and industry experts to discuss these emerging trends, challenges, and solutions in the field.

Highlights

    (1) Invited Talks from leading experts in academia and industry on the latest advancements in multimodal AI.

    (2) Paper Presentations showcasing cutting-edge research contributions in computer vision in the wild.

    (3) Panel Discussions (Tentative) exploring the future of vision-language-action models and their impact on robotics, autonomous systems, and real-world AI applications.

We invite researchers, engineers, and enthusiasts to join us in shaping the future of vision systems that go beyond static image recognition to dynamic, interactive, and real-world AI applications. Stay tuned for more details on speakers, paper submissions, and challenge participation!

For more information, visit our official workshop page and explore our CVinW reading list: 📌 CVinW Readings




Invited Speakers




Manling Li
Northwestern University



Chelsea Finn
Stanford University



Xiaolong Wang
UC San Diego



Mohit Bansal
UNC Chapel Hill



Kate Saenko
Boston University




Tentative Schedule (June 3-4, Wednesday - Thursday)

12:45 PM - 1:00 PM MT
Welcome
Reuben Tan

Bio [Expand]
TODO
1:00 PM - 1:30 PM MT
Invited Talk
Manling Li

Title: TODO
Abstract [Expand]
TODO

Bio [Expand]
TODO
1:30 PM - 2:00 PM MT
Invited Talk
Chelsea Finn

Title: TODO
Abstract [Expand]
TODO

Bio [Expand]
TODO
2:00 PM - 2:30 PM MT

Workshop Paper Presentations

2:30 PM - 3:00 PM MT

Afternoon Break and Poster Session

Poster boards: TODO
3:00 PM - 3:30 PM CT
Invited Talk
Xiaolong Wang

Title: TODO
Abstract [Expand]
TODO

Bio [Expand]
TODO
3:30 PM - 4:00 PM MT
Invited Talk
Mohit Bansal

Title: TODO
Abstract [Expand]
TODO

Bio [Expand]
TODO
4:00 PM - 4:30 PM MT
Invited Talk
Kate Saenko

Title: TODO
Abstract [Expand]
TODO

Bio [Expand]
TODO
4:30 PM - 5:00 PM MT

Panel Discussion + Closing Remarks

Moderator
Jianfeng Gao
Bio [Expand]
TODO

Call for Papers

We welcome original contributions that advance the state of the art in vision-language learning, multimodal perception, and embodied AI, particularly in unconstrained, real-world environments. Topics of interest include, but are not limited to:

  • LMMs & Vision-Language Systems: Open-vocabulary learning, multimodal pretraining, and adaptation.
  • Video Understanding & Temporal Reasoning: Long-range video modeling, causal reasoning, and instruction-following.
  • VLA & Embodied AI: Multimodal action learning, simulation-to-real transfer, and robotic perception.
  • Foundation Models for Vision Tasks: Object detection, segmentation, tracking, and fine-grained recognition in the wild.
  • Efficient Training Methods: Large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
  • New Metrics and Benchmarks: Novel ways to evaluate existing LMMs and large vision models for task-level transfer and open-set visual recognition.

We accept abstract submissions to our workshop. All submissions shall have maximally 8 pages (excluding references) following the CVPR 2025 author guidelines. All submissions will be reviewed by the Program Committee on the basis of technical quality, relevance to scope of the conference, originality, significance, and clarity. Ther review process is double-blind, and the accepted papers are NOT archived in CVPR 2025 Proceeding.



Workshop Paper Submission Portal: [Open Review]
Submission Deadline: May 16th, 2026
Acceptance Notification: May 23rd, 2026
Camera-ready Submission May 30th, 2026

For more information about the paper submission, please reach out the workshop organizers.


Call for Challenge Submissions

We introduce two new challenges to evaluate the performance of large vision models in the wild:

Challenge
Task
Eval Metrics
Instructions
Make a Challenge Submission
MindCube
Spatial Mental Model Reasoning
Accuracy
Leaderboard
SITE-Bench
Spatial Intelligence Thorough Evaluation
Chance-Adjusted Accuracy
Leaderboard


June 1st, 2026 Competition ends
June 6th, 2026 Invitation to present at workshop

For more information about the challenge benchmark, please visit the challenge website and reach out the challenge organizers.



Workshop Organizers



Reuben Tan
Microsoft



Zhengyuan Yang
Microsoft



Jianwei Yang
xAI



Jiasen Lu
Apple



Baolin Peng
Microsoft



Hao Cheng
Microsoft



Qianhui Wu
Microsoft



Oier Mees
Microsoft



Marc Pollefeys
ETH Zurich / Microsoft



Yong Jae Lee
UW Madison



Lijuan Wang
Microsoft



Jianfeng Gao
Microsoft


Challenge Organizers



Manling Li
Northwestern University



Qineng Wang
Northwestern University



Boqing Gong
Boston University



Wenqi Wang
Boston University


Question? Reach out Workshop Organizing Team