ALFRED Leaderboard

ALFRED (Action Learning From Realistic Environments and Directives), is a new benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. Long composition rollouts with non-reversible state changes are among the phenomena we include to shrink the gap between research benchmarks and real-world applications.

Human Performance
Unseen Success Rate: 0.9100
Unseen PLWSR: 0.8580
Unseen GC: 0.9450
Unseen PLW GC Success Rate: 0.8760
Rank Submission Created Unseen
Success Rate
Seen
Success Rate
Seen
PLWSR
Unseen
PLWSR
Seen
GC
Unseen
GC
Seen
PLW GC Success Rate
Unseen
PLW GC Success Rate

Getting Started with ALFRED

Getting the Data

See this guide for downloading the dataset, running models, and data-augmentation. Also see the paper for a description of the challenge.

Scoring

We use a model-agnostic evaluation process to measure the performance of your trained agent. See the guide for running a model on the test sets to create a JSON dump of action-sequences executed by the agent. These action-sequences are replayed on the leaderboard server to compute the performance metrics.

We will report the following metrics:

Seen and Unseen Splits (separate)

Predictions Format

Run your model on test seen and unseen sets, and create an action-sequence dump of your agent:

$ python models/eval/leaderboard.py --model_path <model_path>/best_seen.pth --model models.model.seq2seq_im_mask --data data/json_feat_2.1.0 --gpu --num_threads 5

This will create a JSON file, e.g. task_results_20191218_081448_662435.json, inside the <model_path> folder. Email this file to askforalfred@googlegroups.com, preferrably through a storage link on a platform like Google Drive, Dropbox etc.

Rules

  1. Try to solve the ALFRED dataset: all submissions must be attempts to solve the ALFRED dataset.
  2. Do not exploit the metadata in test-scenes: you should solve the vision-language grounding problem without misusing the metadata from THOR. You are only allowed to use RGB and language instructions (goal & step-by-step) as input for your agents. You cannot use additional depth, mask, metadata info etc. from the simulator on Test Seen and Test Unseen scenes. However, during training you are allowed to use additional info for auxiliary losses etc.
  3. Answer the following questions: a. Did you use additional sensory information from THOR as input, eg: depth, segmentation masks, class masks, panoramic images etc. during test-time? If so, please report them. b. Did you use the alignments between step-by-step instructions and expert action-sequences for training or testing? (no by default; the instructions are serialized into a single sentence)
  4. Share who you are: you must provide a team name and affiliation.
  5. (Optional) Share how you solved it: if possible, share information about how the task was solved. Link an academic paper or code repository if public.
  6. Only submit your own work: you may evaluate any model on the validation set, but must only submit your own work for evaluation against the test set.

Example Models

We provide a pre-trained Seq2Seq+PM (Both) model described in the paper.

Submitting to the Leaderboard

Only one submission is allowed every 7 days. All submissions will be made public. Please do not create anonymous emails for multiple submissions. Use the val set to iterate on your agent.

Getting Help

If you need any help, please email us at askforalfred@googlegroups.com and say you're asking about leaderboard ALFRED. Please include a submission URL if you are asking about a specific submission.

Contact

If you need any help, please email us at askforalfred@googlegroups.com. Please be specific about your query.