Signal

New arXiv methods refine RL post-training and inference-time control for LLM/VLM agents

Evidence first: scan the strongest sources, then decide whether to go deeper.

rss

modelspost_trainingreinforcement_learningalignmentagentsbenchmarks

Source links open

Source links and full evidence are open here. Archive history, compare-over-time, alerts, exports, API, integrations, and workflow are paid.

Back Evidence (6)Get the free brief by email Start free trial

No card needed for the free brief.

Evidence trail (top sources)

top sources (1 domains)

1 top source shown

Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

arXiv cs.LG and cs.AI RSS · arxiv.org · 2026-02-02 05:00 UTC

limited source diversity in top sources

View all evidence

Overview

A new batch of arXiv papers targets practical failure modes in RL-style post-training for LLMs and agentic VLMs—credit assignment in multi-turn tool use, instability from clipping or noisy baselines, and the need for stronger guidance signals under verifiable rewards—alongside an inference-time control approach that improves VLM agent action selection without retraining.

Entities

TSPOCFPOMC-GRPOMetis-SPECSBest-of-QGRPOWebVoyager

Score total

1.41

Momentum 24h

Posts

Origins

Source types

Duplicate ratio

Why now

Multiple related RL optimization papers landed on arXiv in the same release window
Verifiable-reward RL and tool-integrated multi-turn reasoning remain active research areas
Inference-time control is highlighted as a way to adapt agents without retraining

Why it matters

Targets RL post-training pain points: sparse rewards, instability, and weak credit assignment
Several proposals aim to improve performance without large compute increases (e.g., small-rollout stability; inference-time reranking)
Agentic VLM control is framed as improvable via better action selection and reward shaping

LLM analysis

Topic mix: lowPromo risk: lowSource quality: medium

Recurring claims

Sparse, outcome-level rewards in multi-turn search/tool reasoning can ignore process signals and reduce within-group reward variance, motivating turn-level, stage-aware reward allocation.
Clipping-based RL objectives can create optimization pathologies (e.g., zero-gradient regions and instability), motivating a clipping-free alternative with a differentiable penalty.
With small rollout budgets, mean-based shared baselines in group-relative methods can induce advantage sign flips; using a median baseline is proposed to mitigate this.

How sources frame it

TSPO Authors: supportive
CFPO Authors: supportive
Best-of-Q Authors: supportive

Cluster is a coordinated arXiv drop on RL-style post-training and inference-time control for LLM/VLM agents; keep claims tightly tied to abstracts.

All evidence

Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

arXiv cs.LG and cs.AI RSS · arxiv.org · 2026-02-02 05:00 UTC

Show filters & breakdown

Posts loaded: 0Publishers: 1Origin domains: 1Duplicates: -

Platform

Publisher

Origin domain

Relevance tier

Duplicates only

Showing 1 / 0

Top publishers (this list)

arXiv cs.LG and cs.AI RSS (1)

Top origin domains (this list)

arxiv.org (1)