Generalist Language Grounding Agents Challenge

Embodied AI Workshop @ CVPR 2023

Recent embodied agents have been successful in learning navigation and interaction skills from large-scale datasets, but progress has been limited to single-setting domains like either instruction-following or dialogue-driven tasks. To avoid over-specialization of models to specific datasets and tasks, this challenge encourages the development of generalist language grounding agents whose architectures transfer language-understanding and decision-making capabilities across tasks. For this first iteration, we unify aspects of the ALFRED and TEACh datasets. While both datasets are set in the Ai2THOR simulator, they differ along several axes:

Declarative (ALFRED) vs Dialogue (TEACh) language introduces grounding and alignment challenges
Agent heights change depth estimation and segmentation pipelines
Changes to the action spaces and room layout require world model generalization

Participants will submit independently to each leaderboard, and submissions will be ranked by a combined unseen success metric.

🏆 Challenge Winners 🏆

1^st Place🥇

Yonsei VnL: Jinyeon Kim, Byeonghwi Kim, Cheolhong Min, Yuyeong Kim, Taewoong Kim, Jonghyun Choi
Yonsei University

Can you do even better? Checkout ALFRED and TEACh

Challenge Environment

Overview

The focus of this challenge is to build generalist embodied agents that map language to actions in embodied settings. Specifically, we want agents to be capable of solving instruction-following and dialogue-driven grounding tasks. These tasks involve challenges like partial observability, continuous state spaces, and irrevocable actions in rich visual environments. Such challenges are not captured by prior datasets for embodiment [1, 2, 3].

Key Topics

Egocentric and Robotic vision
Language Grounding
Dialogue-Driven Grounding
Navigation and Path Planning
Interactive/Causal Reasoning
Learning from Demonstration
Task and Symbolic Planning
Deep Reinforcement Learning
Commonsense Reasoning

The challenge is based on two benchmarks: ALFRED and TEACh. This workshop exists to bring together Vision, Robotics, and NLP researchers to tackle the unique challenges of this three-field intersection that are often avoided when focusing only on vision-and-language or vision-and-robotics (i.e., 'embodied AI').

Previous versions of this challenge were held at CVPR 2022, CVPR 2021 and ECCV 2020.

Mohit Shridhar

Ishika Singh

Anthony Liang

Tiffany Min

Hao Zhu

Jimin Sun

Yonatan Bisk

Jesse Thomason

Important Dates

Timeline
Challenge Opens	Mar 12
~~Leaderboard closes~~	Jun 12 (AoE)
Announcement	Jun 17

Challenge Details

Participants will submit their best agents to both leaderboards independently. Submissions must be from a single agent that is evaluated on both ALFRED and TEACh. The agent should have some shared weights/modules at the bottleneck, but the input encoders and output decoders can be customized for each environment. The top two submissions will have the opportunity to present their methods at the Embodied AI workshop.

Submission
Email your best unseen success rates from ALFRED and TEACh to Mohit. Please include the (1) list of participants, (2) team name, and (3) model name. ALFRED agents should be submitted to the leaderboard, TEACh agents can be evaluated locally.

1️⃣ ALFRED Challenge

Guidelines
Participants are required to upload their model to our evaluation server with [EAI23] in the submission title, e.g. [EAI23] Seq2seq Model. The evaluation server automatically evaluates the models on an unseen test set. Final numbers for the challenge will be frozen on Jun 12 (AoE). Winning submissions will be required to submit a brief (private) report of technical details for validity checking. We will also conduct a quick code inspection to ensure that the challenge rules weren't violated (e.g. using additional info from test scenes).

Dataset
The challenge is based on the ALFRED Dataset, which contains 25K language annotations of both high-level goals and low-level step-by-step instructions for various tasks set in the AI2THOR simulator. Agents interact with environments through discrete actions and pixelwise masks.

Starter Code
Checkout the FILM repository by So Yeon Min et al.

Evaluation
The leaderboard script records actions taken by a pre-trained agent and dumps them to a JSON file. These deterministic actions in the JSON will be replayed on the leaderboard server for evaluation. This process is model-agnostic, allowing you to use your local resources for test-time inference.

Metric
The submissions will be ranked by Unseen Success Rate.

Rules

Include [EAI23] in the submission title e.g. [EAI23] Seq2seq Model.
The agent should have some shared weights/modules at the bottleneck, but the input encoders and output decoders can be customized for each environment.
Do not exploit the metadata in test-scenes: you should solve the vision-language grounding problem without misusing the metadata from THOR. For leaderboard evaluations, agents should just use RGB input and language instructions (goal & step-by-step). You cannot use additional depth, mask, metadata info etc. from the simulator on Test Seen and Test Unseen scenes. Submissions that use additional info on test scenes will be disqualified. However, during training you are allowed to use additional info for auxiliary losses etc.
During evaluation, agents are restricted to max_steps=1000 and max_fails=10. Do not change these settings in the leaderboard script; these modifications will not be reflected in the evaluation server.
You can publish your results on the leaderboard only once every 7 days.
Do not spam the leaderboard with repeated submissions (under different email accounts) in order to optimize on the test set. Fine-tuning should be done only on the validation set. Violators will be disqualified from the challenge.
Try to solve the ALFRED dataset: all submissions must be attempts to solve the ALFRED dataset.
Answer the following questions: a. Did you use additional sensory information from THOR as input, eg: depth, segmentation masks, class masks, panoramic images etc. during test-time? If so, please report them. b. Did you use the alignments between step-by-step instructions and expert action-sequences for training or testing? (no by default; the instructions are serialized into a single sentence)
Share who you are: you must provide a team name and affiliation.
(Optional) Share how you solved it: if possible, share information about how the task was solved. Link an academic paper or code repository if public.
Only submit your own work: you may evaluate any model on the validation set, but must only submit your own work for evaluation against the test set.

2️⃣ TEACh Challenge

Guidelines
The TEACh leaderboard is not ready for the challenge 😔. Participants are required to run evaluations locally on the Task from Dialogue (TfD) val_unseen split, and report scores in the submission email to Mohit. Final numbers for the challenge will be frozen on Jun 12 (AoE). Winning submissions will be required to submit a brief (private) report of technical details for validity checking. We will also conduct a quick code inspection to ensure that the challenge rules weren't violated (e.g. using additional info from test scenes).

Dataset
The challenge is based on the TEACh Dataset, which contains 3000 human-to-human interactive dialogues. In these dialogues, a commander with access to oracle information communicates in natural language with a follower to achieve a specific task.

Starter Code
Checkout the Episodic Transformer (Pashevich et al.) implementation in the repo.