I discovered that Hugging Face has adapted the core technology GRPO of DeepSeek-R1, so I decided to give it a try. I chose the ERC task (Emotion Recognition in Conversations) to see if a smaller model could be cold-start trained on a single task using reinforcement learning and improve task performance.

First, this technology is very memory-intensive. I initially tried to train gemma-2–2b and qwen-2.5–3b-instruct using an A100–80G, but the memory was insufficient.

After switching to qwen-2.5–0.5b-instruct, the memory issue was resolved. Secondly, the inference speed is particularly slow because the same prompt needs to be repeatedly sampled during training.

Fortunately, Hugging Face quickly adapted vLLM, improving efficiency. However, this brought new issues:

Using vLLM to assist GRPO training requires at least two GPUs, which actually increases the resource demand, merely shifting the inference load to a dedicated card.
There was a persistent strange error _assert_memory_footprint_increased_during_profiling. After checking the issues in trl, it seems that upgrading vLLM to version 0.7 is necessary to resolve it.

datasets==3.0.1
trl==0.14.0
transformers==4.48.2
peft==0.14.0
accelerate==1.3.0
deepspeed==0.15.3
torch==2.5.1
vllm==0.7.1

chinese version

发现Hugging Face已经对DeepSeek-R1的核心技术GRPO进行了适配，我决定尝试一下。我选择了ERC任务（对话情绪识别），想看看一个较小的模型能否通过强化学习在单一任务上进行冷启动训练，并提升任务性能。
首先，这项技术非常耗费显存。我最初尝试用A100–80G来训练gemma-2–2b和qwen-2.5–3b-instruct，但显存不够用。调整为qwen-2.5–0.5b-instruct后，显存才不再爆掉。其次，推理速度特别慢，因为训练过程中需要对同一个提示反复进行采样。不过，值得庆幸的是，Hugging Face很快适配了vLLM，提高了效率。但这又带来了新的问题：

使用vLLM来辅助GRPO训练至少需要两张显卡，对资源的需求实际上更高了，只是把推理负载转移到一张专门的卡上。
一直出现奇怪的错误_assert_memory_footprint_increased_during_profiling，查了一下trl的issue，似乎需要把vLLM提升到0.7版本才能解决。

Author Of article : 张逸群 Read full article

GRPO Pitfalls Record

chinese version

Read Next

Building an NBA Data Lake with AWS: Challenges, Use Cases, and Future Enhancements

Flutter Flip Card Animation