TG-CSR

Theoretically-Grounded Common Sense Reasoning

OVERVIEW

Theoretically-Grounded Commonsense Reasoning (TG-CSR) is a benchmark intended to systematically evaluate machine commonsense in a few-shot question answering (QA) setting. We refer to the benchmark as theoretically grounded because each question and its candidate answer always falls under a “representational area” in the Gordon-Hobbs theory of commonsense, including “time”, “space”, “emotions” and so on. The goal of the benchmark is to assess whether a single machine commonsense model, such as a language representation model, is able to answer questions across all these different categories without too much training data.

The task format of TG-CSR is multi-set QA. For each category, a set of candidate answers is provided (about 10 answers per category) and shared among all questions in the same category. However, to make it easier for large-scale models to answer questions, we “pair” each question with a single candidate answer. The task then becomes binary QA: a prediction of “yes” (encoded as 1) indicates that the answer is a good fit for the question and “no” (encoded as 0) indicates otherwise.

Being a phased benchmark, the initial dataset consists of questions with a "vacationing abroad" context, for which training, development and test question sets are provided. More details on format and submission are provided in the starting kit. TG-CSR was designed to initially be a few-shot problem, but will later be a zero-shot problem for which no training or development sets will be provided. In upcoming phases, we will be releasing three additional sets of questions with the same format and Gordon-Hobbs representational areas, but designed using three other contexts.

DATASET ACCESS / SUBMISSION

TG-CSR has been launched on the CodaLab platform, where submissions can be made directly and a starting kit is provided, along with access to labeled training and validation data (for the first release only; future phases will be purely zero shot and only include unlabeled test data), and unlabeled test data. To enable preliminary exploration (since the link above requires signup), we are also making the labeled training and development set downloadable here.

CITE

Please cite the following paper if you are using this benchmark:

Santos H, Shen K, Mulvehill A M, Razeghi Y, McGuinness D L, Kejriwal M. A Theoretically Grounded Benchmark for Evaluating Machine Commonsense[J]. arXiv preprint arXiv:2203.12184, 2022.

ORGANIZATIONS

USC RPI UCI

ACKNOWLEDGMENT

This effort is part of the MOWGLI project. MOWGLI is a project in the DARPA MCS program, supported by United States Office Of Naval Research under Contract No. N660011924033.

c