Skip to main content

Explore R2ATA

Overview of R2ATA

R2ATA consists of 3 main reasoning datasets, GSM8K, BBH, and MMLU. All the topics covered in R2ATA benchmark are shown here.

*In our paper, we explored 0, 1, 2, 4, 8 adversarial edits. For our R2ATA dataset, we will only be releasing questions that contain 4 adversarial edits.

Edited Words Statistics

Word Cloud of Words that have been edited

We can see that across all 3 word clouds, there is minimal presence of stop words among frequently edited words. This indicates that edits target content-bearing words, suggesting that the edits aim to disrupt the text's logical flow, coherence, or semantics, thus strategically influencing the model's reasoning abilities.

Big-Bench Hard (BBH)

Grade School Math 8K (GSM8K)

Massive Multitask Language Understanding (MMLU)

Distribution of Part-Of-Speech (POS) Tags of Edited Words

Nouns, Verbs, and Adjectives constitute the majority of edited words.

Percentage Distribution of Datasets by Error Types

The predominance of whitespace errors highlights a key vulnerability in LLMs.

Performance Statistics across 0, 1, 2, 4, 8 edits

Edit Distance

From left to right, each data point represents 0, 1, 2, 4, 8 edits respectively.

We can see that even a small number of edits leads to a substantial increase in edit distance, resulting in a significant decline in accuracy.

Jaccard Coefficient

From left to right, each data point represents 8, 4, 2, 1, 0 edits respectively.

Note that the greater the Jaccard Coefficient, the more similar the edited question is to the original question.

Despite this increase in edit distance, the Jaccard coefficient remains relatively stable, consistently exceeding 0.8 across all edits.