We introduce a stepwise self-evaluation mechanism tailored to enhance the reasoning capabilities of Large Language Models (LLMs). We propose an effective prompting framework integrating self-evaluation guidance via stochastic beam search. The self-evaluation guidance serves as a better-calibrated automatic criterion, facilitating an efficient search in the reasoning space and resulting in superior prediction quality. With temperature-controlled randomness, stochastic beam search balances exploitation and exploration of the reasoning search space. This allows our approach to excel in producing high-quality single-chain generation and adapt well to the multiple-chain scenario with high diversity. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by 6.34%, 9.56%, and 5.46% on the GSM8K, AQuA, and StrategyQA benchmarks, respectively. Further analysis in multi-step reasoning finds our self-evaluation guidance pinpoints logic failures and leads to higher consistency and robustness.
Prior works on LLM reasoning illustrate significant improvement in model performance by breaking down a problem into intermediate stages – a reasoning chain (e.g., chain-of-thought (CoT), program-aided language models (PAL)). However, as the complexity and length of reasoning chains increase, LLMs struggle with errors and imperfections that accumulate across multiple intermediate steps. Furthermore, the growing number of steps leads to an exponential growth in the search space for reasoning, making it exceedingly difficult to obtain accurate final outcomes.
In our work, we introduce a stepwise self-evaluation mechanism tailored to facilitate an efficient search in the reasoning space. We adopt LLM self-evaluation as an automatic criterion to guide the reasoning process, drawing inspiration from previous works on utilizing LLMs for self-evalution (Kadavath et al. 2023). Specifically, we formulate the reasoning chain generation as a decoding process consisting of multiple intermediate steps. We employ stochastic beam search decoding to balance the exploitation and exploration of the reasoning search space with temperature-controlled randomness.
We compare the predictions of baseline and our methods on particular instances. Scores from low to high are visualized from orange, yellow, to green. Here \(\mathcal{C}\), \(\mathcal{P}\), and \(\mathcal{E}\) represent the evaluation confidence, the generation confidence (probability), and their combination as the final self-evaluation score, respectively.
In general, the evaluation confidence \(\mathcal{C}\) is more effective at identifying logical errors, taking into account accumulated mistakes from prior steps, while the generation probability \(\mathcal{P}\) focuses more on text perplexity as the confidence of the LLM generator.
You may refer to related works such as CoT, PAL, Tree-of-Thought, and LLM-Reasoners which serve as foundations and variations of our Guided-Decoding framework and code repository.
@misc{xie2023decomposition,
title={Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding},
author={Yuxi Xie and Kenji Kawaguchi and Yiran Zhao and Xu Zhao and Min-Yen Kan and Junxian He and Qizhe Xie},
year={2023},
eprint={2305.00633},
archivePrefix={arXiv},
primaryClass={cs.CL}
}