How Hugging Face Made DeepSeek-R1 Training Reproducible

DeepSeek-R1 showed what reinforcement learning could do for reasoning models—chain-of-thought responses that worked, strong performance on math and code benchmarks. Then researchers tried to reproduce the training and hit a wall. The weights were public, but the data generation pipeline? Closed. The RL workflow scripts? Undocumented. The end-to-end training recipe? A black box.

Hugging Face's open-r1 project exists to fill those gaps. Built on their 768-H100 Science Cluster, it reconstructs the entire DeepSeek-R1 training stack from scratch—synthetic data generation, multi-stage supervised fine-tuning, GRPO reinforcement learning, evaluation flows. Not just weights you can download, but infrastructure you can run.

The problem: DeepSeek-R1 without the recipe

DeepSeek released R1 and R1-Zero with strong reasoning capabilities and published weights. But for anyone who wanted to train a similar model in-house—or verify the claims—critical pieces were missing. How do you generate the synthetic training data? What does the RL phase look like? How do you chain together supervised fine-tuning on different task types (math, code, reasoning) before the reinforcement learning stage?

Hacker News discussions around the R1 release kept circling back to the same theme: good results, opaque infrastructure. Labs that wanted to experiment with reasoning models on their own hardware couldn't replicate the approach from a paper description. Open-r1 positions itself as "the last missing pieces"—the scripts, data workflows, and training recipes to make R1-style training reproducible outside DeepSeek's internal systems.

What open-r1 provides

The technical stack centers on three components. First, synthetic data generation via Distilabel, Hugging Face's pipeline for creating reasoning-focused training examples. Second, GRPO (Group Relative Policy Optimization) training scripts that handle the reinforcement learning phase where models learn to produce step-by-step reasoning traces. Third, multi-stage workflows that coordinate supervised fine-tuning on math and code datasets before the RL loop kicks in.

The infrastructure runs on Hugging Face's Science Cluster—768 H100 GPUs—which gives the project enough compute to test training runs at scale. This isn't a finished product. The repo labels itself as a work in progress, with datasets and RL phases still under development. But the goal is clear: give practitioners the tools to run their own reasoning-model experiments instead of treating R1 as a reference implementation they can't touch.

Who's using it

EvolvingLMMs-Lab maintains open-r1-multimodal, a fork that extends the pipeline to multimodal model training. Witness AI's security analysis of "Open R1 vs DeepSeek" treats the project as a reference implementation for evaluating data transparency and misuse risks. GOSIM AI's Paris 2025 conference program includes a talk on "Open-R1: A Fully Open Reproduction of DeepSeek-R1." Jay Alammar's "Illustrated DeepSeek-R1" newsletter points to open-r1 as the default community implementation—evidence that it's becoming the coordination point for researchers working on R1-style systems.

The caveats

Performance is uncertain. Hacker News threads show skepticism that the community reproduction will match DeepSeek's original results—training scale and data quality matter, and both are hard to verify. Security researchers flag concerns about open reasoning models increasing misuse risk, particularly for agentic systems that might execute code. And the work-in-progress status means components are incomplete, documentation is sparse, and training runs might not produce reliable results yet.

What becomes possible with an open reasoning-model stack

When the training pipeline is reproducible, the dynamics shift. Labs can experiment with reasoning models in-house without relying on closed infrastructure. Researchers can verify performance claims by running their own training loops. Security teams can audit capabilities before models reach production. Open-r1 isn't solving the reasoning problem—it's making the problem accessible to anyone with sufficient compute and the will to experiment.

huggingface/open-r1

Fully open reproduction of DeepSeek-R1

26.0kstars

2.4kforks

View on GitHub Sponsor