GLUE stands for General Language Understanding Evaluation. It is a benchmark dataset for evaluating the performance of models across a diverse set of existing natural language understanding tasks. The dataset is designed to encourage the development of models that can perform a wide range of natural language understanding tasks. The dataset consists of nine tasks, including:
CoLA (Corpus of Linguistic Acceptability): A binary classification task where the goal is to predict whether a sentence is grammatically correct or not.
SST-2 (Stanford Sentiment Treebank): A binary classification task where the goal is to predict whether a sentence has a positive or negative sentiment.
MRPC (Microsoft Research Paraphrase Corpus): A binary classification task where the goal is to predict whether two sentences are paraphrases of each other.
QQP (Quora Question Pairs): A binary classification task where the goal is to predict whether two questions are semantically equivalent.
STS-B (Semantic Textual Similarity Benchmark): A regression task where the goal is to predict the similarity between two sentences on a scale from 0 to 5.
MNLI (Multi-Genre Natural Language Inference): A three-way classification task where the goal is to predict whether a premise entails, contradicts, or is neutral with respect to a hypothesis.
QNLI (Question-answering Natural Language Inference): A binary classification task where the goal is to predict whether a question can be answered by a given context.
RTE (Recognizing Textual Entailment): A binary classification task where the goal is to predict whether a premise entails a hypothesis.
WNLI (Winograd Schema Challenge): A binary classification task where the goal is to predict whether a sentence exhibits coreference resolution.
The official website for the GLUE benchmark dataset is here. The website provides detailed information about the dataset, including the tasks, the evaluation metrics, and the leaderboard. The website also provides links to the datasets and the evaluation scripts.