Generalist Language Grounding Agents Challenge
Embodied AI Workshop @ CVPR 2023
Recent embodied agents have been successful in learning navigation and interaction skills from large-scale datasets, but progress has been limited to single-setting domains like either instruction-following or dialogue-driven tasks. To avoid over-specialization of models to specific datasets and tasks, this challenge encourages the development of generalist language grounding agents whose architectures transfer language-understanding and decision-making capabilities across tasks. For this first iteration, we unify aspects of the ALFRED and TEACh datasets. While both datasets are set in the Ai2THOR simulator, they differ along several axes:
- Declarative (ALFRED) vs Dialogue (TEACh) language introduces grounding and alignment challenges
- Agent heights change depth estimation and segmentation pipelines
- Changes to the action spaces and room layout require world model generalization
Challenge Environment
Overview
The focus of this challenge is to build generalist embodied agents that map language to actions in embodied settings. Specifically, we want agents to be capable of solving instruction-following and dialogue-driven grounding tasks. These tasks involve challenges like partial observability, continuous state spaces, and irrevocable actions in rich visual environments. Such challenges are not captured by prior datasets for embodiment [1, 2, 3].
- Key Topics
- Egocentric and Robotic vision
- Language Grounding
- Dialogue-Driven Grounding
- Navigation and Path Planning
- Interactive/Causal Reasoning
- Learning from Demonstration
- Task and Symbolic Planning
- Deep Reinforcement Learning
- Commonsense Reasoning
Important Dates
Timeline | |
Challenge Opens | Mar 12 |
Jun 12 (AoE) | |
Announcement | Jun 17 |
Challenge Details
Participants will submit their best agents to both leaderboards independently. Submissions must be from a single agent that is evaluated on both ALFRED and TEACh. The agent should have some shared weights/modules at the bottleneck, but the input encoders and output decoders can be customized for each environment. The top two submissions will have the opportunity to present their methods at the Embodied AI workshop.
Submission
Email your best unseen success rates from ALFRED and TEACh to Mohit. Please include the (1) list of participants, (2) team name, and (3) model name. ALFRED agents should be submitted to the leaderboard, TEACh agents can be evaluated locally.
1οΈβ£ ALFRED Challenge
Guidelines
Participants are required to upload their model to our evaluation server with [EAI23]
in the submission title, e.g. [EAI23] Seq2seq Model. The evaluation server automatically evaluates the models on an unseen test set. Final numbers for the challenge will be frozen on Jun 12 (AoE). Winning submissions will be required to submit a brief (private) report of technical details for validity checking. We will also conduct a quick code inspection to ensure that the challenge rules weren't violated (e.g. using additional info from test scenes).
Dataset
The challenge is based on the ALFRED Dataset, which contains 25K language annotations of both high-level goals and low-level step-by-step instructions for various tasks set in the AI2THOR simulator. Agents interact with environments through discrete actions and pixelwise masks.
Starter Code
Checkout the FILM repository by So Yeon Min et al.
Evaluation
The leaderboard script records actions taken by a pre-trained agent and dumps them to a JSON file. These deterministic actions in the JSON will be replayed on the leaderboard server for evaluation. This process is model-agnostic, allowing you to use your local resources for test-time inference.
Metric
The submissions will be ranked by Unseen Success Rate.
Rules
- Include
[EAI23]
in the submission title e.g. [EAI23] Seq2seq Model. - The agent should have some shared weights/modules at the bottleneck, but the input encoders and output decoders can be customized for each environment.
- Do not exploit the metadata in test-scenes: you should solve the vision-language grounding problem without misusing the metadata from THOR. For leaderboard evaluations, agents should just use RGB input and language instructions (goal & step-by-step). You cannot use additional depth, mask, metadata info etc. from the simulator on Test Seen and Test Unseen scenes. Submissions that use additional info on test scenes will be disqualified. However, during training you are allowed to use additional info for auxiliary losses etc.
- During evaluation, agents are restricted to
max_steps=1000
andmax_fails=10
. Do not change these settings in the leaderboard script; these modifications will not be reflected in the evaluation server. - You can publish your results on the leaderboard only once every 7 days.
- Do not spam the leaderboard with repeated submissions (under different email accounts) in order to optimize on the test set. Fine-tuning should be done only on the validation set. Violators will be disqualified from the challenge.
- Try to solve the ALFRED dataset: all submissions must be attempts to solve the ALFRED dataset.
- Answer the following questions: a. Did you use additional sensory information from THOR as input, eg: depth, segmentation masks, class masks, panoramic images etc. during test-time? If so, please report them. b. Did you use the alignments between step-by-step instructions and expert action-sequences for training or testing? (no by default; the instructions are serialized into a single sentence)
- Share who you are: you must provide a team name and affiliation.
- (Optional) Share how you solved it: if possible, share information about how the task was solved. Link an academic paper or code repository if public.
- Only submit your own work: you may evaluate any model on the validation set, but must only submit your own work for evaluation against the test set.
2οΈβ£ TEACh Challenge
Guidelines
The TEACh leaderboard is not ready for the challenge π. Participants are required to run evaluations locally on the Task from Dialogue (TfD) val_unseen
split, and report scores in the submission email to Mohit. Final numbers for the challenge will be frozen on Jun 12 (AoE). Winning submissions will be required to submit a brief (private) report of technical details for validity checking. We will also conduct a quick code inspection to ensure that the challenge rules weren't violated (e.g. using additional info from test scenes).
Dataset
The challenge is based on the TEACh Dataset, which contains 3000 human-to-human interactive dialogues. In these dialogues, a commander with access to oracle information communicates in natural language with a follower to achieve a specific task.
Starter Code
Checkout the Episodic Transformer (Pashevich et al.) implementation in the repo.
Metric
The submissions will be ranked by Unseen Success Rate.
Rules
- Report scores on the Task from Dialogue (TfD)
val_unseen
split. See this example for evaluating a trained agent. - Do not exploit the metadata in test-scenes: you should solve the vision-language grounding problem without misusing the metadata from THOR. For leaderboard evaluations, agents should just use RGB input and dialogue interactions. You cannot use additional depth, mask, metadata info etc. from the simulator on Test Seen and Test Unseen scenes. Submissions that use additional info on test scenes will be disqualified. However, during training you are allowed to use additional info for auxiliary losses etc.
- Only submit your own work: you may evaluate any model on the validation set, but must only submit your own work for evaluation against the validation set.
π Evaluation Metric
Submissions will be ranked by a combined score that equally weighs Unseen Success Rates from both ALFRED and TEACh:Submissions
Rank | Model | ALFRED Unseen Success |
ALFRED Unseen PLW |
TEACh Unseen Success |
TEACh Unseen PLW |
Combined |
---|---|---|---|---|---|---|
Human Performance University of Washington (Shridhar et al. '20) |
91.0 | - | - | - | - | |
π₯ 11 June, 2023 | ECLAIR Yonsei University | 50.4 | - | 13.2 | - | 31.8 |
Seq2seq Baseline University of Washington, Amazon, USC Viterbi | - | - | - | - | - |
FAQ
- Do we need to submit a report?
Winning submissions will be required to submit a brief (private) report of technical details for validity checking. Also consider submitting a workshop paper to EAI. See submission guidelines for EAI. - Do we need to submit a video?
The top two winning submission will need to submit a brief video explaining their methods and results. These videos will be featured on this website and during the EAI workshop. - Is there a prize for the winner?
TBA