ALFRED Leaderboard

Getting Started with ALFRED

Getting the Data

See this guide for downloading the dataset, running models, and data-augmentation. Also see the paper for a description of the challenge.

Scoring

We use a model-agnostic evaluation process to measure the performance of your trained agent. See the guide for running a model on the test sets to create a JSON dump of action-sequences executed by the agent. These action-sequences are replayed on the leaderboard server to compute the performance metrics.

We will report the following metrics:

Seen and Unseen Splits (separate)

success_rate: overall task goal completion
path_length_weighted_success_rate: overall task goal completion weighted by the number of steps in the gold trajectory
postcondition_success_rate: number of partial goal conditions completed by the agent
path_length_weighted_postcondition_success_rate: number of partial goal conditions completed weighted by the number of steps taken in the gold trajectory

Predictions Format

Run your model on test seen and unseen sets, and create an action-sequence dump of your agent:

$ python models/eval/leaderboard.py --model_path <model_path>/best_seen.pth --model models.model.seq2seq_im_mask --data data/json_feat_2.1.0 --gpu --num_threads 5

This will create a JSON file, e.g. task_results_20191218_081448_662435.json, inside the <model_path> folder. Email this file to askforalfred@googlegroups.com, preferrably through a storage link on a platform like Google Drive, Dropbox etc.

Rules

Try to solve the ALFRED dataset: all submissions must be attempts to solve the ALFRED dataset.
Do not exploit the metadata in test-scenes: you should solve the vision-language grounding problem without misusing the metadata from THOR. You are only allowed to use RGB and language instructions (goal & step-by-step) as input for your agents. You cannot use additional depth, mask, metadata info etc. from the simulator on Test Seen and Test Unseen scenes. However, during training you are allowed to use additional info for auxiliary losses etc.
Answer the following questions: a. Did you use additional sensory information from THOR as input, eg: depth, segmentation masks, class masks, panoramic images etc. during test-time? If so, please report them. b. Did you use the alignments between step-by-step instructions and expert action-sequences for training or testing? (no by default; the instructions are serialized into a single sentence)
Share who you are: you must provide a team name and affiliation.
(Optional) Share how you solved it: if possible, share information about how the task was solved. Link an academic paper or code repository if public.
Only submit your own work: you may evaluate any model on the validation set, but must only submit your own work for evaluation against the test set.

Example Models

We provide a pre-trained Seq2Seq+PM (Both) model described in the paper.

Submitting to the Leaderboard

Only one submission is allowed every 7 days. All submissions will be made public. Please do not create anonymous emails for multiple submissions. Use the val set to iterate on your agent.

Getting Help

If you need any help, please email us at askforalfred@googlegroups.com and say you're asking about leaderboard ALFRED. Please include a submission URL if you are asking about a specific submission.