Stella/Jasper Training Code: MTEB Leaders' Secret Sauce

Training custom retrieval models for production RAG typically requires resources most teams don't have. This repository releases the actual training code behind Stella and Jasper, the embedding models topping MTEB leaderboards. The unified pipeline handles dense embedders, multi-vector retrievers, and rerankers with advanced distillation techniques—but independent researchers report extreme hyperparameter sensitivity that forced them to abandon parts of it.

Featured Repository Screenshot

Fine-tuning retrieval components for production RAG hits a wall fast. Dense embeddings need multi-stage distillation. Multi-vector retrievers like ColBERT eat resources. Rerankers demand their own training pipeline. Most teams lack the compute and expertise to build custom retrievers from scratch, so they settle for pre-trained models that don't quite fit their data.

The training code behind Stella and Jasper—embedding models at the top of MTEB leaderboards—went public. This isn't a wrapper around someone else's inference API. It's the pipeline that produced stella_en_1.5B_v5, stella_en_400M_v5, and the multimodal jasper_en_vision_language_v1, complete with the multi-stage distillation techniques and Matryoshka dimension reduction that got them there.

The December 2024 paper "Jasper and Stella: distillation of SOTA embedding models" formalized the approach and released this repository as the training codebase. For teams stuck with generic embeddings or couldn't afford the multi-step training process, this looked like a breakthrough.

What this repository actually contains

The pipeline handles three retrieval components in one framework: dense embedding retrievers, ColBERT-style multi-vector retrievers, and rerankers. The technical approach centers on multi-stage distillation—teaching smaller, faster models to mimic larger teacher models while reducing dimensionality without sacrificing retrieval quality.

Matryoshka Representation Learning handles the dimension reduction, allowing the same embedding model to work at multiple dimensionalities. The workflow trains models through sequential distillation steps, progressively compressing knowledge from state-of-the-art retrieval systems into more efficient architectures for production deployment.

Multiple Hugging Face model cards for NovaSearch's Stella models link directly to this repository as their implementation source. Research papers on positional bias and selective retrieval cite it as the location for code and data.

The hyperparameter sensitivity problem

Then came the reality check. Researchers working on the mxbai-edge paper tried adapting the codebase for their own small retriever training. Their experience: "poorly performing models" with "fluctuating performance" and "extremely high sensitivity to hyperparameters."

They didn't just tweak a few settings and move on. They abandoned the full multi-step process entirely, simplifying to a different loss function because the original pipeline proved too fragile to reproduce reliably. The partial codebase initially released by the Stella authors worked in the original lab environment but didn't transfer to independent research teams with different infrastructure and data.

This isn't an edge case. The complexity that makes the pipeline powerful—multi-stage distillation, dimension reduction, handling multiple retrieval types—also makes it sensitive to configuration choices that aren't fully documented. The gap between "this produced top MTEB scores" and "you can reproduce this" remains significant.

Who should (and shouldn't) attempt this

Teams with dedicated ML infrastructure, existing experience with distillation pipelines, and specific RAG tuning requirements that pre-trained models can't meet have the best shot. If you're already running multi-GPU training jobs and have researchers who understand retrieval architecture tradeoffs, the techniques here are worth studying.

For most practitioners, simpler approaches make more sense. Pre-trained Stella or Jasper models are available on Hugging Face—just use those. If you need customization, fine-tuning a pre-trained embedding model with standard techniques will get you 80% of the benefit without the hyperparameter minefield.

The honest assessment: this repository exposes techniques behind models that dominate MTEB rankings. It's research code that happened to produce production-worthy results in the hands of its creators. Whether it works for anyone else depends on resources, expertise, and tolerance for experimental infrastructure. The training methods are real. The reproducibility challenges are equally real.


NovaSearch-TeamNO

NovaSearch-Team/RAG-Retrieval

Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT, ReRanker.

996stars
78forks
ai
llm
nlp
rag
retrieval-augmented-generation