In this page, you can find more information about the tasks included in the benchmark and how to download the corresponding benchmark data for evaluation.
Task | #Contexts | #Examples | Download |
---|---|---|---|
Dialogue | 1,169 | 3,548 | |
Dialogue Summarization | 453 | 1,805 | |
Intent Detection | 589 | 2,440 | |
Safety Detection | 366 | 2,826 | |
Stance Classification | 397 | 1,722 | |
Machine Translation (en-de) | 500 | 1,000 | |
Machine Translation (en-fr) | 500 | 1,000 | |
Machine Translation (en-ru) | 500 | 1,000 | |
Machine Translation (zh-en) | 600 | 1,200 |
Overview
Each task example in this benchmark comes with a context and a target. Definitions of context and the target depend on the corresponding task. See below for more details. Same context can be associated with multiple targets, hence, the number of unique contexts and the total number of examples are different.
Open-domain Dialogue
In this task, the goal is to generate a plausible response (target) to a given dialogue history (context). Here is an example:
Data for this task benchmark is collected based on the conversational dataset CIDER which itself is constructed from other dialogue datasets DailyDialog, MuTual and DREAM.
Dialogue Summarization
In this task, the goal is to generate a correct summary (target) of the given dialogue (context). Here is an example:
Data for this task benchmark is collected based on the dialogue summarization dataset DialogSum which itself is constructed from other dialogue datasets DailyDialog, MuTual and DREAM.
Intent Detection
In this task, the goal is to identify the underlying intent (target) of the author of the text (e.g. news headline) (context). Here is an example:
Data for this task benchmark is collected based on the Misinformation Reaction Frames (MRF) dataset.
Safety Detection
In this task, the goal is to detect the safe actions (target) in a given scenario (context). Here is an example:
Data for this task benchmark is collected based on the SafeText dataset.
Stance Classification
In this task, the goal is to infer the stance of an argument (target) given a belief (context). Here is an example:
Data for this task benchmark is collected based on the ExplaGraphs dataset.
Machine Translation
In this task, the goal is to generate a correct translation (target) in some language of a given text (context) in another language . Here is an example:
Data for this task benchmark is collected based on the Chinese-English test suite (for Chinese to English) and Wino-X multi-lingual Winograd Schema dataset (for English to Russian, French and German).