Theoretically-Grounded Commonsense Reasoning (TG-CSR) is a benchmark intended to systematically evaluate machine commonsense in a few-shot question answering (QA) setting. We refer to the benchmark as theoretically grounded because each question and its candidate answer always falls under a “representational area” in the Gordon-Hobbs theory of commonsense, including “time”, “space”, “emotions” and so on. The goal of the benchmark is to assess whether a single machine commonsense model, such as a language representation model, is able to answer questions across all these different categories without too much training data.
The task format of TG-CSR is multi-set QA. For each category, a set of candidate answers is provided (about 10 answers per category) and shared among all questions in the same category. However, to make it easier for large-scale models to answer questions, we “pair” each question with a single candidate answer. The task then becomes binary QA: a prediction of “yes” (encoded as 1) indicates that the answer is a good fit for the question and “no” (encoded as 0) indicates otherwise.
Being a phased benchmark, the initial dataset consists of questions with a "vacationing abroad" context, for which training, development and test question sets are provided. More details on format and submission are provided in the starting kit. TG-CSR was designed to initially be a few-shot problem, but will later be a zero-shot problem for which no training or development sets will be provided. In upcoming phases, we will be releasing three additional sets of questions with the same format and Gordon-Hobbs representational areas, but designed using three other contexts.