Explore R2ATA
Overview of R2ATA
R2ATA consists of 3 main reasoning datasets, GSM8K, BBH, and MMLU. All the topics covered in R2ATA benchmark are shown here.
*In our paper, we explored 0, 1, 2, 4, 8 adversarial edits. For our R2ATA dataset, we will only be releasing questions that contain 4 adversarial edits.
Edited Words Statistics
Word Cloud of Words that have been edited
We can see that across all 3 word clouds, there is minimal presence of stop words among frequently edited words. This indicates that edits target content-bearing words, suggesting that the edits aim to disrupt the text's logical flow, coherence, or semantics, thus strategically influencing the model's reasoning abilities.
Big-Bench Hard (BBH)
Grade School Math 8K (GSM8K)
Massive Multitask Language Understanding (MMLU)
Distribution of Part-Of-Speech (POS) Tags of Edited Words
Nouns, Verbs, and Adjectives constitute the majority of edited words.
Percentage Distribution of Datasets by Error Types
The predominance of whitespace errors highlights a key vulnerability in LLMs.
Performance Statistics across 0, 1, 2, 4, 8 edits
Edit Distance
From left to right, each data point represents 0, 1, 2, 4, 8 edits respectively.
We can see that even a small number of edits leads to a substantial increase in edit distance, resulting in a significant decline in accuracy.
Jaccard Coefficient
From left to right, each data point represents 8, 4, 2, 1, 0 edits respectively.
Note that the greater the Jaccard Coefficient, the more similar the edited question is to the original question.
Despite this increase in edit distance, the Jaccard coefficient remains relatively stable, consistently exceeding 0.8 across all edits.