Policy-Guided Path Selection and Evaluation in Multi-Step Reasoning with Large Language Models

Abstract
This paper addresses core challenges in Chain-of-Thought reasoning with large language models, including path instability, structural redundancy, and lack of strategy control. It proposes a reasoning optimization framework that integrates multi-path evaluation and policy-based scheduling. The framework consists of two main components: the Multi-Path Adaptive Evaluation (MPAE) module and the Policy-Aware Reasoning Scheduler (PARS). These components systematically improve Chain-of-Thought performance from two perspectives: structural quality modeling and behavioral decision control. MPAE encodes multiple reasoning paths into vector representations and assigns semantic scores. It constructs a learnable path quality function and uses the scores to guide path aggregation and answer generation. PARS introduces reinforcement learning to build a path selection policy network. It dynamically adjusts scheduling behavior based on reward signals. This improves the stability and consistency of reasoning outputs. Experiments are conducted on the GSM8K benchmark for mathematical reasoning. The evaluation includes multiple metrics such as accuracy, consistency, and robustness. Compared to existing Chain-of-Thought methods, the proposed framework shows clear advantages in structural selection and strategy adaptability. Ablation studies reveal the individual contributions of MPAE and PARS to overall performance. Additional experiments on path distribution and robustness confirm that the framework maintains stable reasoning under high uncertainty. The overall approach features a clear structure, controllable strategy, and adaptive path selection. It effectively enhances Chain-of-Thought reasoning and output quality in complex tasks.