Interpretability on Fahim Dalvi

Three Papers Accepted at EMNLP & ArabicNLP 2025

Sat, 15 Nov 2025 13:00:00 +0300

Thrilled to announce that we have two papers have been accepted at EMNLP 2025, and one in the co-located ArabicNLP 2025 conference.

Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing

Fine-tuning LLMs is the go-to way to improve them, but most evaluations only tell you that a model got better, not why. Our paper takes a different approach using model diffing in order to compare the internal representations of two models. We specifically choose SimPO as a case study, and use crosscoders to find and categorize the latent concepts that differentiate the original Gemma-2-9b-it from its fine-tuned version. By looking at the mechanistic changes, we can attribute performance gains to concrete capabilities. This gives us a much richer, more actionable picture of what fine-tuning actually does.

Paper Accepted at Interspeech 2025!

Fri, 15 Aug 2025 13:00:00 +0300

Our paper From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models has been accepted at Interspeech 2025.

LLMs have shown that text-only training can give models remarkable reasoning abilities and abstract semantic understanding. This raises a fascinating question: do speech models develop similar conceptual structures when trained only on audio? And when models are trained on both speech and text together, do they build a richer understanding?

We used Latent Concept Analysis from our prior work on interpretability to examine how semantic abstractions form across modalities, and find lots of interesting differences on how speech and text modalities differ in their internal representations. We released our code and a curated audio version of the SST-2 dataset on GitHub and Hugging Face to support reproducibility.

Paper Accepted at EMNLP 2024

Tue, 12 Nov 2024 13:00:00 +0300

Pleased to share that our paper, Latent Concept-based Explanation of NLP Models, has been accepted at EMNLP 2024!

This paper continues our series of work on interpretability. We introduce a method called LACOAT (Latent Concept Attribution) that connects predictions with latent concepts present in a model’s representation. Hence, we move beyond attribution to individual tokens in the input to a more holistic concept.

The code is available on GitHub. Congratulations to Xuemin, Nadir, Marzia, and Hassan on this work!