Generalist Language Grounding Agents Challenge
Embodied AI Workshop @ CVPR 2023
Recent embodied agents have been successful in learning navigation and interaction skills from large-scale datasets, but progress has been limited to single-setting domains like either instruction-following or dialogue-driven tasks. To avoid over-specialization of models to specific datasets and tasks, this challenge encourages the development of generalist language grounding agents whose architectures transfer language-understanding and decision-making capabilities across tasks. For this first iteration, we unify aspects of the ALFRED and TEACh datasets. While both datasets are set in the Ai2THOR simulator, they differ along several axes:
- Declarative (ALFRED) vs Dialogue (TEACh) language introduces grounding and alignment challenges
- Agent heights change depth estimation and segmentation pipelines
- Changes to the action spaces and room layout require world model generalization
The ALFRED Leaderboard is accepting submissions to the Embodied AI Challenge with
[EAI23]tags! Humans have a success rate of 91% on unseen environments, but the best models are still far behind 😢
Can you do even better? Code, precomputed features, and AI2Thor simulator are all available on Github. For a quick start, checkout FILM – a SoTA agent.
The TEACh Leaderboard is coming soon! Submit your agent with the
[EAI23]tags. The best performing agents have a unseen success rate of <1% 😖
Can you do even better? Code and baselines are all available on Github.
The focus of this challenge is to build generalist embodied agents that map language to actions in embodied settings. Specifically, we want agents to be capable of solving instruction-following and dialogue-driven grounding tasks. These tasks involve challenges like partial observability, continuous state spaces, and irrevocable actions in rich visual environments. Such challenges are not captured by prior datasets for embodiment [1, 2, 3].
- Key Topics
- Egocentric and Robotic vision
- Language Grounding
- Dialogue-Driven Grounding
- Navigation and Path Planning
- Interactive/Causal Reasoning
- Learning from Demonstration
- Task and Symbolic Planning
- Deep Reinforcement Learning
- Commonsense Reasoning
|Challenge Opens||Mar 12|
|Leaderboard closes||Jun 12|
|Winner announcement||Jun 17|
Participants will submit their best agents to both leaderboards independently. Submissions must be from a single agent that is evaluated on both ALFRED and TEACh. The top two submissions will have the opportunity to present their methods at the Embodied AI workshop.
1️⃣ ALFRED Challenge
Participants are required to upload their model to our evaluation server with
[EAI23] in the submission title, e.g. [EAI23] Seq2seq Model. The evaluation server automatically evaluates the models on an unseen test set. Final numbers for the prize challenge will be frozen on Jun 12. Winning submissions will be required to submit a brief (private) report of technical details for validity checking. We will also conduct a quick code inspection to ensure that the challenge rules weren't violated (e.g. using additional info from test scenes).
The challenge is based on the ALFRED Dataset, which contains 25K language annotations of both high-level goals and low-level step-by-step instructions for various tasks set in the AI2THOR simulator. Agents interact with environments through discrete actions and pixelwise masks.
Checkout the FILM repository by So Yeon Min et al.
The leaderboard script records actions taken by a pre-trained agent and dumps them to a JSON file. These deterministic actions in the JSON will be replayed on the leaderboard server for evaluation. This process is model-agnostic, allowing you to use your local resources for test-time inference.
The submissions will be ranked by Unseen Success Rate.
[EAI23]in the submission title e.g. [EAI23] Seq2seq Model.
- Do not exploit the metadata in test-scenes: you should solve the vision-language grounding problem without misusing the metadata from THOR. For leaderboard evaluations, agents should just use RGB input and language instructions (goal & step-by-step). You cannot use additional depth, mask, metadata info etc. from the simulator on Test Seen and Test Unseen scenes. Submissions that use additional info on test scenes will be disqualified. However, during training you are allowed to use additional info for auxiliary losses etc.
- During evaluation, agents are restricted to
max_fails=10. Do not change these settings in the leaderboard script; these modifications will not be reflected in the evaluation server.
- You can publish your results on the leaderboard only once every 7 days.
- Do not spam the leaderboard with repeated submissions (under different email accounts) in order to optimize on the test set. Fine-tuning should be done only on the validation set. Violators will be disqualified from the challenge.
- Try to solve the ALFRED dataset: all submissions must be attempts to solve the ALFRED dataset.
- Answer the following questions: a. Did you use additional sensory information from THOR as input, eg: depth, segmentation masks, class masks, panoramic images etc. during test-time? If so, please report them. b. Did you use the alignments between step-by-step instructions and expert action-sequences for training or testing? (no by default; the instructions are serialized into a single sentence)
- Share who you are: you must provide a team name and affiliation.
- (Optional) Share how you solved it: if possible, share information about how the task was solved. Link an academic paper or code repository if public.
- Only submit your own work: you may evaluate any model on the validation set, but must only submit your own work for evaluation against the test set.
2️⃣ TEACh ChallengeDetails coming soon ...
📈 Evaluation MetricSubmissions will be ranked by a combined score that equally weighs Unseen Success Rates from both ALFRED and TEACh:
University of Washington(Shridhar et al. '20)
University of Washington, Amazon, USC Viterbi
- Do we need to submit a report?
Winning submissions will be required to submit a brief (private) report of technical details for validity checking. Also consider submitting a workshop paper to EAI. See submission guidelines for EAI.
- Do we need to submit a video?
The top two winning submission will need to submit a brief video explaining their methods and results. These videos will be featured on this website and during the EAI workshop.
- Is there a prize for the winner?