Paper Accepted at ACL 2024

Excited to share that our paper Exploring Alignment in Shared Cross-lingual Spaces has been accepted at ACL 2024. This paper aims to build a better understanding of how Multilingual Models align different languages internally in their representation space. Multilingual language models like mBERT, XLM-R, and mT5 are trained on dozens of languages, but we don’t really know how aligned the representations are across languages inside the model. Do they share a common conceptual space, or does each language occupy its own corner?

Our paper tackles this question by using clustering to uncover latent concepts within multilingual models and measuring how these concepts overlap across languages. We introduce two new metrics — CALIGN and COLAP — that quantify cross-lingual alignment and concept overlap.

Here’s what we found:

Deeper layers = more alignment. The deeper you go in the network, the more the model converges on language-agnostic, shared representations.
Fine-tuning helps. When you train the model on a specific task, the latent space becomes even more aligned across languages.
This explains zero-shot power. The task-specific calibration that drives alignment also helps explain why multilingual models can transfer knowledge to languages they’ve barely seen during training.

In practical terms, these findings help us understand why multilingual models work so well at cross-lingual transfer and where they might break down. This matters for anyone building systems that need to work across languages, especially for lower-resource ones. The code is available on GitHub. Congratulations to Basel, Nadir, and Majd on this work!

Comments

Say something: