Understand the focus in improving Nash direct

Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) CHING-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santirro, Microsoft Research;
(5) Ahmed Awad Allah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang XIE, Microsoft Research and Comfortance to [email protected].
Links table
Abstract and 1 introduction
2 introductory
2.1 RLHF based on reward models
2.2 RLHF with general preferences
3 Receive Nash Live and 3.1 Derivatives of algorithm 1
3.2 Theoretical analysis
4 Practical algorithm-Self-improving Repetition
5 experiments and 5.1 experimental preparation
5.2 results and analysis
6 related work
7 conclusion and references
Excessive
Extension of organized preferences
B detailed proofs
C additional experimental details
B detailed proofs
In this section, we provide detailed evidence for our theoretical results. Note that the definitions and assumptions submitted greatly depend on ideas related to the issue of version and focus from the literature of the theory of reinforcement (ESP, XIE ET Al., 2021, 2023). However, the descriptions mentioned here are intended to clarify the basic visions of the algorithm design. Full and comprehensive theoretical analysis is outside the basic range of this paper. We now offer the following definitions and assumptions.
Definition 2 can be considered a natural extension of the concentration of reinforcement learning literature (unlawful) to our preparation.
Evidence of theory 2. We will now provide the guide using the following procedure consisting of two steps.
Step 1: From slope with a record loss to a square error. With the standard results of slope with the logarithmic loss, we know
Note that similar results can also apply to the limited post. For simplicity, we delete the detailed discussion in our paper. For more in -depth discussions about the slope with the logarithmic loss, the reader can indicate, for example, Foster and Crichnamorothy (2021).
On the other hand, we have