Three Papers Accepted at EMNLP & ArabicNLP 2025

Sat, 15 Nov 2025 13:00:00 +0300

Thrilled to announce that we have two papers have been accepted at EMNLP 2025, and one in the co-located ArabicNLP 2025 conference.

Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing

Fine-tuning LLMs is the go-to way to improve them, but most evaluations only tell you that a model got better, not why. Our paper takes a different approach using model diffing in order to compare the internal representations of two models. We specifically choose SimPO as a case study, and use crosscoders to find and categorize the latent concepts that differentiate the original Gemma-2-9b-it from its fine-tuned version. By looking at the mechanistic changes, we can attribute performance gains to concrete capabilities. This gives us a much richer, more actionable picture of what fine-tuning actually does.

Paper Accepted at COLING 2025

Mon, 20 Jan 2025 13:00:00 +0300

Excited to share that our paper AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs has been accepted at COLING 2025 in Abu Dhabi.

Arabic isn’t just one language — it’s a family of dialects that vary dramatically from region to region. Yet most LLM evaluations treat Arabic as a monolith, using only Modern Standard Arabic (MSA). This paper addresses that gap by introducing AraDiCE, a benchmark that evaluates LLMs on both dialectal and cultural dimensions. We evaluated several LLMs on these benchmarks and found an interesting pattern: Arabic-specific models like Fanar, Jais and AceGPT do outperform general multilingual models on dialectal tasks. But significant challenges remain — particularly in dialect identification, generation, and translation.

Llm on Fahim Dalvi

Three Papers Accepted at EMNLP & ArabicNLP 2025

Paper Accepted at COLING 2025