Fanar

Pretraining lead on Qatar's very own large language model

2026, Feb 01 4 mins read

Fanar is Qatar’s sovereign GenAI platform, built to preserve the Arabic language and culture. I lead the pretraining team and contribute across other teams as well.

Training at this scale of ~200 GPUs, massive datasets, and large models is super demanding, but in the best way. It’s brought every foundational CS concept from my university days back to life: distributed systems, networking, databases, security, Data Structures, Compilers and more. Also one of the largest projects I’ve worked on in terms of the number (and variety) of people involved, which has been incredibly fun and rewarding as well!

Try the platform today on the Web or via our mobile apps (iOS and Android), and sign up at the API to start using Fanar in your use cases today.

Also, check out our tech reports and the models:

Fanar 2.0: Arabic Generative AI Stack

Fanar Team

arXiv preprint

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

Abstract

PDF

Cite (.bib)

Models

@misc{fanar2026fanar20arabicgenerative,
    title={Fanar 2.0: Arabic Generative AI Stack},
    author={Fanar Team},
    year={2026},
    eprint={2603.16397},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2603.16397},
}

Fanar: An Arabic-Centric Multimodal Generative AI Platform

Fanar Team

arXiv preprint

We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star and Fanar Prime, two highly capable Arabic Large Language Models (LLMs) that are best in the class on well established benchmarks for similar sized models. Fanar Star is a 7B (billion) parameter model that was trained from scratch on nearly 1 trillion clean and deduplicated Arabic, English and Code tokens. Fanar Prime is a 9B parameter model continually trained on the Gemma-2 9B base model on the same 1 trillion token set. Both models are concurrently deployed and designed to address different types of prompts transparently routed through a custom-built orchestrator. The Fanar platform provides many other capabilities including a customized Islamic Retrieval Augmented Generation (RAG) system for handling religious prompts, a Recency RAG for summarizing information about current or recent events that have occurred after the pre-training data cut-off date. The platform provides additional cognitive capabilities including in-house bilingual speech recognition that supports multiple Arabic dialects, voice and image generation that is fine-tuned to better reflect regional characteristics. Finally, Fanar provides an attribution service that can be used to verify the authenticity of fact based generated content.

Abstract

PDF

Cite (.bib)

Models

@misc{fanar,
    title={Fanar: An Arabic-Centric Multimodal Generative AI Platform},
    author={Fanar Team},
    year={2025},
    eprint={2501.13944},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2501.13944},
}