Three Papers Accepted at EACL 2024

Thrilled to announce that three papers have been accepted at EACL 2024, Here’s a quick peek at what each paper explores.

LLMeBench: Making LLM Evaluation Easier

Large language models are being used for an ever-wider range of tasks and languages, but evaluating them across different setups can be surprisingly cumbersome. Our team built LLMeBench, a flexible framework that lets you evaluate LLMs on any NLP task in just a few lines of code. It comes with ready-made dataset loaders, supports multiple model providers (including local models, OpenAI API compatible hosted ones), and handles most standard evaluation metrics out of the box. Whether you want to test zero-shot or few-shot learning, it’s all supported. We put it through its paces across 31 unique NLP tasks, 53 datasets, and roughly 296K data points. The framework is open source and ready for the community to use. You can watch a demo here.

LAraBench: How Do LLMs Compare to Specialized Models for Arabic?

While LLMs like GPT-4 have shown impressive multilingual abilities, there’s been less work rigorously comparing them against models specifically built for Arabic. With LAraBench, we aim to fill this gap. We tested GPT-3.5, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM across 33 Arabic NLP and speech processing tasks spanning 61 datasets — that’s over 296K data points and 46 hours of speech. The key takeaway: specialized Arabic models still generally outperform LLMs in zero-shot settings, but larger LLMs with few-shot techniques are closing the gap significantly. This is valuable intel for anyone deciding whether to use a general-purpose LLM or a domain-specific model for Arabic.

Scaling Up Discovery of Latent Concepts in Deep NLP Models

Deep NLP models are often described as “black boxes” because we can’t easily see what concepts they’ve learned internally. Our previous concept analysis work from ICLR ‘22 used clustering to find these hidden concepts, but it was too slow for large datasets and models. In this paper, we explored faster clustering algorithms to scale up this discovery process. We found that K-Means based approaches dramatically speed things up without sacrificing quality. Even more exciting: we were able to apply this at scale to large language models and discover phrasal concepts, multi-word expressions that the model has encoded. This opens the door to understanding what LLMs “know” about language at a much deeper level.

Congratulations to all the authors on these accepted papers!

Comments

Say something: