Cornell NLVR

Cornell NLVR

Cornell Natural Language Visual Reasoning (NLVR) is a language grounding dataset. It contains 92,244 pairs of natural language statements grounded in synthetic images. The task is to determine whether a sentence is true or false about an image. The data was collected through crowdsourcing, and requires reasoning about sets of objects, quantities, comparisons, and spatial relations.

Have questions? Please visit our Github issues page or contact Alane Suhr (suhr < at > To keep up to date with major updates, please subscribe:


There is exactly one black triangle not touching any edge true

there is at least one tower with four blocks with a yellow block at the base and a blue block below the top block true

There is a box with multiple items and only one item has a different color. false

There is exactly one tower with a blue block at the base and yellow block at the top false

More examples (from the development set) are available here.


The data is split into training, development, and two test sets. The first test set is public and available with the data, the second will not be released. The leaderboard shows accuracy for the development and public test sets, as well as accuracy and consistency for the unreleased test set (Test-U). The ranking in the leaderboards below is based on accuracy on the unreleased test set.

Instructions for running on the unreleased test set

To avoid overfitting and degrading the leaderboard held-out test set, we require two months or more between runs on the leaderboard test set. We will do our best to run within two weeks (usually we will run much faster). We will only post results on the leaderboard when an online description of the system is available. Testing on the leaderboard test set is meant to be the final step before publication. Under extreme circumstances, we reserve the right to limit running on the leaderboard test set to systems that are mature for publication. Your model should generate a prediction file in the format specified in the NLVR readme and run with the provided evaluation scripts.

Please contact Alane Suhr if you wish to run on the unreleased test set.


Date Model Dev Test-P Test-U (Acc) Test-U (Cons)
2018.04.20 CNN-BiATT (visual spatial filters with bidirectional matchings): Tan and Bansal 2018 66.9% 69.7% 66.1% 28.9%
2017.04.22 Neural Module Networks (Andreas et al. 2016), details in Suhr et al. 2017 63.1% 66.1% 62.0% -

Structured Representations

Date Model Dev Test-P Test-U (Acc) Test-U (Cons)
2017.11.14 AbsTAU (semantic parsing with example abstraction): Goldman et al. 2017 85.7% 84.0% 82.5% 63.9%
2018.04.20 BiATT-Pointer (object ordering with bidirectional matchings): Tan and Bansal 2018 74.6% 73.9% 71.8% 37.2%
2017.04.22 MaxEnt on sent+img features, details in Suhr et al. 2017 68.0% 67.7% 67.8% -