Reinforcement Learning: An Overview
We’re introducing HALO 😇
Hierarchal Agent Loop Optimizer
HALO is an RLM-based agent optimization technique capable of recursively self-improving agents by analyzing their execution traces and suggesting changes.
This work is inspired by the Mismanaged Genius Hypothesis
RLM ♥ GEPA: You can use RLMs to improve RLMs with GEPA
The State of Reinforcement Learning for LLM Reasoning
Understanding GRPO and New Insights from Reasoning Model Papers
Reinforcement Learning (RL) Guide | Unsloth Documentation
Learn all about Reinforcement Learning (RL) and how to train your own DeepSeek-R1 reasoning model with Unsloth using GRPO. A complete guide from beginner to advanced.
Advanced: Reinforcement Learning, Kernels, Reasoning, Quantization & Agents AIE 2025
➤ Check out our updated Reinforcement Learning guide!
LangGraph Rollout: Evolving VeRL’s Multi-Turn Capabilities for Agent RL
After completing our multi-turn tokenization and masking refactoring, we eliminated a critical bottleneck that was preventing us from building a more consistent and flexible rollout system for our Agent RL research. This breakthrough enabled us to implement a LangGraph-based rollout for VeRL in just a few days, which we’ve already successfully deployed in our Agent RL experiments. In this article, I’ll share our journey from VeRL’s native multi-turn implementation to our new LangGraph-based solution, explaining both the motivations driving this evolution and the technical details of our implementation.