crypto news

The art of controversy with yourself – and why artificial intelligence makes more intelligent

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) CHING-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santirro, Microsoft Research;

(5) Ahmed Awad Allah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang XIE, Microsoft Research and Comfortance to [email protected].

Abstract and 1 introduction

2 introductory

2.1 RLHF based on reward models

2.2 RLHF with general preferences

3 Receive Nash Live and 3.1 Derivatives of algorithm 1

3.2 Theoretical analysis

4 Practical algorithm-Self-improving Repetition

5 experiments and 5.1 experimental preparation

5.2 results and analysis

6 related work

7 conclusion and references

Excessive

Extension of organized preferences

B detailed proofs

C additional experimental details

a summary

This paper taught large language models after training (LLMS) using a strong Oracle preference notes to help improve the model repeatedly over itself. The typical approach to LLMS after training includes learning to enhance human comments (RLHF), which traditionally separates learning and subsequent policy. However, the approach to maximizing the rewards is limited to the nature of “wise” rewards (such as the Bradley Terry model), which fails to express complex or periodic preference relationships. While progress in RLHF shows the rewards and improving policy can be integrated into one goal that contradicts stability, it is still linked to the framework of maximizing rewards. Recently, a new wave of research avoids assumptions to maximize rewards in favor of direct improvement over “wise” or public preferences. In this paper, we offer a direct Nash improvement (DNO), a developed and developed algorithm that marries the simplicity and stability of learning contrary to theoretical public from improving general preferences. Since DNO is a double -quality algorithm using a downward target, its implementation is clear and effective. Moreover, DNO has improved monotony through the repetitions that help it improve a strong landmark (such as GPT-4). In our experiments, the Orca-2.5 model resulting from the 7B parametering DNO will win the latest GPT-4-Turbo model by 33 % on Alpakifal 2.0 (even after control over the response), an absolute profit of 26 % (7 % → 33 %) in the preparation style. It surpasses much more parameters models, including the Great Poor LM, LM (70B), and old versions of GPT-4. Our eradication studies analyze the critical design decisions surrounding the choice of preference pairs, and the use of LLMS-ERERENCE-ALTOATORS. These results emphasize the DNO promise in LLMS after training, as well as providing implemented visions to the artificial intelligence research community.

1 introduction

The field of artificial intelligence develops towards advanced models that can understand, track complex instructions, and create accurate content, with compatibility with human values ​​and preferences. LLMS models (EG, Brown et al., 2020; Ouyang et al Reliability, safety, and moral specifications. To face these challenges, the formulation of LLMS using reinforcement learning from the human reactions (RLHF) (Chrostiano et al., 2017; Bai et al

The RLHF framework has been long studied in the context of learning to reinforce the preference -based (RL) or RL of human preferences (for example, Knox and Stone, 2008; Akour et al RLHF traditional methods usually assume that the preference is determined by numerical bonus function through some models, such as the Bradley-Terry (BT) model (BRADLEY and Terry, 1952).[1] Then RLHF improves the preference in a two -step procedure: learning is a reward, improving policy (through RL) to increase the bonus learned. Under certain circumstances, the procedure consisting of two steps to an educational approach can be simplified from one step (Rafailov et al., 2023), which eliminates the need to learn frank rewards. The algorithms of this type (for example, RAFAILOV et al These algorithms are originally not connected to the Internet, and are characterized by improved stability and ease of improvement. However, the RLHF algorithms of two steps and their contrasting variables from one step depend mainly on the framework of maximizing bonuses, where the preferences based on rewards are subject, for example, the BT model.

Figure 1: The direct Nash improvement achieves the modern results of the Great Teacher Language Model 7B, as it is the first to exceed 30 % in both the raw winning rate and control of length against GPT-4-Turbo. Winning rate and winning LC have 0.93 to 0.98 correlation with Chatbot Arena degrees.Figure 1: The direct Nash improvement achieves the modern results of the Great Teacher Language Model 7B, as it is the first to exceed 30 % in both the raw winning rate and control of length against GPT-4-Turbo. Winning rate and winning LC have 0.93 to 0.98 correlation with Chatbot Arena degrees.

The reward framing is largely subjected to a large extent. The bonus functions, knowledge to take out the numerical degree R (x, y) for one response to enter X, express the general preferences y ≻ y ′ | X between a pair of outputs in all cases, for example, INTRARANSITIVE or patrol (ELO, 1978). Thus, the trained LLMS cannot be in line with the maximization of rewards with human preference always. Moreover, modern works show that even in the settings that can be ideally expressed within the framework of bonuses -based BT models, improvement towards rewards leads to problematic behaviors; We refer the reader to Bertrand and others. (2023); Azar et al. (2023); Monos and others. (2023) For more details. Finally, reward functions in practical practice can become “old” quickly as transformations in policy are distributed under training (Ross et al., 2011; Cheng et al

We are excited to overcome two separate challenges: the limited expression of the RLHF -based RLHF, and lack of clarity on how to expand the scope of improvement in general preferences. Recent developments in the bonus-based improvement, for example, DPO, have already effective and developed applications-we seek an effective solution in the same way as the framework of general preferences.

We suggest developed and developed RLHF algorithm – Tahseen Nash direct (DNO) (algorithm 1) that achieves the best in the two worlds, and combines the ability to expand from contradictory goals with theoretical integrity to improve general preference. DNO is designed as a dual quality algorithm with the target of slope -based learning; This design choice makes DNO stable and developing, which leads to a balance between the efficiency of publishing and the ability to adapt.

We summarize at the high level of the main ingredients and DNO visions below.

  1. To address the issue that cannot express the rewards of general preferences, we benefit from modern ideas that must be expressed by the concept of winning reward as expected regarding the general preference job.[2]

  2. To address the problem in the previous work that improving this most general goal with online algorithms is an insufficient or unstable sample, we analyze the learning procedure into a series of “integrated quality” repetitions, as each step instead improves the goal of simple slope.

  3. The goal of the slope (we choose the bilateral internet) is compatible with the “internal bonus function” of the policy to the expected victory rate compared to itself (as specified in line 3 of the algorithm 1). By taking samples from the outputs from the current policy for use in training (i.e. “self -play”), this procedure stimulates self -behavior.

  4. Our framework is general enough to acknowledge an outside samples to training, and most importantly, that is from a more powerful teacher (see µ1 and µ2 in Al -Khwarizmia 1).

  5. Moreover, to ensure stability and mathematical efficiency, we suggest a liquidation scheme so that the bonuses are made only on the preference pairs with a large enough margin (to see the theoretical interpretation, see Section 4; in practice, see Section 5.2).

  6. DNO reiterates this procedure for multiple repetitions to allow politics to improve towards general preference. Since every step includes the problem of slope, it can be implemented easily.

In theory, we prove that DNO is converging with the intended balance of Nash on average, and it can improve monotony through repetitions (see Section 3.1). Moreover, our limited sample analysis shows that the approximate error of any repetition between the policy learned and the goal is tightly limited (Theorem 1).

On the practical side, we offer a developmental implementation of DNO (algorithm 2): a repetitive algorithm for self -improvement with contradictory updates, which are close to the algorithm 1 under many critical design options. These options include the following: taking samples from multiple online outputs from the policy that is trained, using GPT-4 as preference oranges, comparing OnPolike samples with private GPT-4 outputs (teacher), and training only on pairs with a “large margin” (to see the theoretical interpretation, see Section 4; in practice, see Section 5.2).

The basic distinction of our work on related works from Nash-MD (Munos et al., 2023) and Spo (Swamy et al We solve the efficiency problem with an effective goal of efficiency that works in practice, and DNO is more flexible to integrate samples outside the quality of EG, a strong teacher.

More importantly, DNO works in practice-we offer comprehensive experimental assessments, which leads to newer performance:

• The ORCA-2.5 Glossary model, aligned with the use of the DNO (algorithm 2) practical implementation, achieves the most recent model of any 7B model, exceeding 33 % against the GPT-4-Turbo behind Alpakaeval 2.0, even after control of length. This is more than 26 % absolute profit (7 % → 33 %) compared to the prepared model. It surpasses many advanced advanced advanced models, including Mistral Large and GPT-4-0613, as well as open source models with much more parameters (10 x), such as LM Self-reward (Yuan Et Al., 2024) that contains 70b.

• Our comprehensive eradication studies in Section 5.2, studying the critical design points surrounding the selection of the loss function (the guidance subject to supervision or contradiction), the training form (with or without samples on the device), the quality of preference related to the comment (a large margin or not), and the construction of the training pair (self-player, teacher-voltage, etc.). The results we have reached highlighting that carefully coded methods in algorithm 2 lead to great gains.

• We show some examples of outputs through repetitions that show qualitative improvements such as better addressing accurate issues and supposed questions (table 5), better organizing and clarity while refraining from making misleading data (Table 6), and the intensity of higher information in answers (table 7).

We hope that the results presented here will provide clear to society regarding the use of LLMS intelligence notes after training.


[1] We use the “reward model” to refer to a frame that translates the preferences into bonuses, for example, Bradley-Treeri, while the “bonus function” is a function (perhaps beneficial) that comes out with a bonus.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblock Detected

Please consider supporting us by disabling your ad blocker