NLVR

Natural Language for Visual Reasoning

Cornell Natural Language for Visual Reasoning

The Natural Language for Visual Reasoning corpora are two language grounding datasets containing natural language sentences grounded in images. The task is to determine whether a sentence is true about a visual input. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. This includes two corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.


Natural Language for Visual Reasoning for Real

New NLVR2 contains 107,292 examples of human-written English sentences grounded in pairs of photographs. NLVR2 retains the linguistic diversity of NLVR, while including much more visually complex images.


Data Paper (Suhr et al. 2018)


The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing.
true

One image shows exactly two brown acorns in back-to-back caps on green foliage.
false

Image Credit (in order left-right, top-bottom): MemoryCatcher (CC0), Calabash13 (CC BY-SA 3.0), Charles Rondeau (CC0), Andale (CC0).


We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service. We will get back to you within a week.


Natural Language for Visual Reasoning

NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthically generated, this dataset can be used for semantic parsing.


Data Paper (Suhr et al. 2017)


There is exactly one black triangle not touching any edge
true

there is at least one tower with four blocks with a yellow block at the base and a blue block below the top block
true

There is a box with multiple items and only one item has a different color.
false

There is exactly one tower with a blue block at the base and yellow block at the top
false

More examples (from the development set) are available here.



Leaderboards

Both NLVR and NLVR2 are split into training, development, and two test sets. One test set is public (Test-P) and available with the data, and the other is not released (Test-U). We maintain a leaderboard displaying accuracy and consistency on the unreleased test set (as well as accuracy on the development and public test sets). Results are ordered by accuracy on the unreleased test set; ties are broken with consistency.


We require two months or more between runs on each leaderboard test set. We will do our best to run within two weeks (usually we will run much faster). We will only post results on the leaderboard when an online description of the system is available. Testing on the leaderboard test set is meant to be the final step before publication. Under extreme circumstances, we reserve the right to limit running on the leaderboard test set to systems that are mature for publication. Your model should generate a prediction file in the format specified in the NLVR readme and run with the provided evaluation scripts. You can request to add your model to the leaderboard even if you don't evaluate on the unreleased test set.


For both datasets, we use two evaluation metrics: accuracy and consistency. Accuracy (Acc) is computed as the proportion of examples (sentence-image pairs) for which a model correctly predicted a truth value. Consistency (Cons) measures the generalization of a model. It is computed as the proportion of unique sentences for which a model correctly predicted the truth value for all paired images (Goldman et al., 2018).



Questions?

Please visit our Github issues page or email us at

nlvr < at > googlegroups.com

Please email us if you wish to run on an unreleased test set. To keep up to date with major changes, please subscribe:



Acknowledgments

This research was supported by the NSF (CRII-1656998), a Facebook ParlAI Research Award, an AI2 Key Scientific Challenges Award, Amazon Cloud Credits Grant, and support from Women in Technology New York. This material is based on work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. We thank Mark Yatskar and Noah Snavely for their comments and suggestions, and the workers who participated in our data collection for their contributions.


Also thanks to SQuAD for allowing us to use their code to create this website!

NLVR2 Leaderboard

NLVR2 presents the task of determining whether a natural language sentence is true about a pair of photographs.

Rank Model Dev. (Acc) Test-P (Acc) Test-U (Acc) Test-U (Cons)
Human Performance

Cornell University

(Suhr et al. 2018)
96.2 96.3 96.1 -

1

Nov 1, 2018
MaxEnt

Cornell University

(Suhr et al. 2018)
54.1 54.8 53.5 12.0

2

Nov 1, 2018
CNN+RNN

Cornell University

(Suhr et al. 2018)
53.4 52.4 53.2 11.2

3

Nov 1, 2018
FiLM

MILA, ran by Cornell University

(Perez et al. 2018)
51.0 52.1 53.0 10.6

4

Nov 1, 2018
Image Only (CNN)

Cornell University

(Suhr et al. 2018)
51.6 51.9 51.9 7.1

5

Nov 1, 2018
N2NMN, policy search from scratch

UC Berkeley, ran by Cornell University

(Hu et al. 2017)
51.0 51.1 51.5 5.0

6

Nov 1, 2018
Majority Class

Cornell University

(Suhr et al. 2018)
50.9 51.1 51.4 4.6

6

Nov 1, 2018
Text Only (RNN)

Cornell University

(Suhr et al. 2018)
50.9 51.1 51.4 4.6*

8

Nov 1, 2018
MAC-Network

Stanford University, ran by Cornell University

(Hudson and Manning 2018)
50.8 51.4 51.2 11.2
*The current ArXiv version reports this as 4.8; this is a typo. It will be updated as soon as we can update the ArXiv version.

NLVR Leaderboard

NLVR presents the task of determining whether a natural language sentence is true about a synthetically generated image. We divide results into whether they process the image pixels directly (Images) or whether they use the structured representations of the images (Structured Representations).

Images

Rank Model Dev. (Acc) Test-P (Acc) Test-U (Acc) Test-U (Cons)
Human Performance

Cornell University

94.6 95.4 94.9 -

1

Apr 20, 2018
CNN-BiATT

UNC Chapel Hill

(Tan and Bansal 2018)
66.9 69.7 66.1 28.9

2

Nov 1, 2018
N2NMN, policy search from scratch

UC Berkeley, ran by Cornell University

(Hu et al. 2017)
65.3 69.1 66.0 17.7

3

Apr 22, 2017
Neural Module Networks

UC Berkeley, ran by Cornell University

(Andreas et al. 2016)
63.1 66.1 62.0 -

4

Nov 1, 2018
FiLM

MILA, ran by Cornell University

(Perez et al. 2018)
60.1 62.2 61.2 18.1

5

Apr 22, 2017
Majority Class

Cornell University

(Suhr et al. 2017)
55.3 56.2 55.4 -

6

Nov 1, 2018
MAC-Network

Stanford University, ran by Cornell University

(Hudson and Manning 2018)
55.4 57.6 54.3 8.6

Unranked

Sept 7, 2018
CMM

Chinese Academy of Sciences

(Yao et al. 2018)
68.0 69.9 - -

Unranked

May 23, 2018
W-MemNN

Federico Santa María Technical University & Pontífica Universidad Católica de Valparaíso

(Pavez et al. 2018)
65.6 65.8 - -

Structured Representations

Rank Model Dev. (Acc) Test-P (Acc) Test-U (Acc) Test-U (Cons)
Human Performance

Cornell University

94.6 95.4 94.9 -

1

Nov 14, 2017
AbsTAU

Tel-Aviv University

(Goldman et al. 2017)
85.7 84.0 82.5 63.9

2

Apr 4, 2018
BiATT-Pointer

UNC Chapel Hill

(Tan and Bansal 2018)
74.6 73.9 71.8 37.2

3

Apr 22, 2017
MaxEnt

Cornell University

(Suhr et al. 2017)
68.0 67.7 67.8 -

4

Apr 22, 2017
Majority Class

Cornell University

(Suhr et al. 2017)
55.3 56.2 55.4 -