Cornell NLVR

Cornell NLVR

Cornell Natural Language Visual Reasoning (NLVR) is a language grounding dataset. It contains 92,244 pairs of natural language statements grounded in synthetic images. The task is to determine whether a sentence is true or false about an image. The data was collected through crowdsourcing, and requires reasoning about sets of objects, quantities, comparisons, and spatial relations.

Have questions? Please visit our Github issues page or contact Alane Suhr (suhr < at > cs.cornell.edu). To keep up to date with major updates, please subscribe:


Examples

There is exactly one black triangle not touching any edge true

there is at least one tower with four blocks with a yellow block at the base and a blue block below the top block true

There is a box with multiple items and only one item has a different color. false

There is exactly one tower with a blue block at the base and yellow block at the top false

More examples (from the development set) are available here.


Leaderboard

The data is split into training, development, and two test sets. The first test set is public and available with the data, the second will not be released. The ranking in the leaderboards below is based on results on the unreleased test set. We will soon post instruction for submitting systems to test with the unreleased data. In the meantime, if you are interested in testing your system, please contact us.

Images

Date Model Development Public Test Unreleased Test
2017.04.22 Neural Module Networks (Andreas et. al. 2016), details in Suhr et. al. 2017 63.06% 66.12% 61.99%

Structured Representations

Date Model Development Public Test Unreleased Test
2017.04.22 MaxEnt on sent+img features, details in Suhr et. al. 2017 68.04% 67.68% 67.82%