Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies
Abstract
SafeFlow Q-Learning extends FQL to safe offline reinforcement learning by combining a Hamilton-Jacobi reachability-inspired safety value function with an efficient one-step flow policy, achieving lower inference latency and reduced constraint violations.
Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning (2026)
- Conditional Sequence Modeling for Safe Reinforcement Learning (2026)
- Safe Langevin Soft Actor Critic (2026)
- Latent Policy Steering through One-Step Flow Policies (2026)
- Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective (2026)
- GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL (2026)
- How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
