Optimization directly exceeds the largest models with better data

Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) CHING-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santirro, Microsoft Research;
(5) Ahmed Awad Allah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang XIE, Microsoft Research and Comfortance to [email protected].
Links table
Abstract and 1 introduction
2 introductory
2.1 RLHF based on reward models
2.2 RLHF with general preferences
3 Receive Nash Live and 3.1 Derivatives of algorithm 1
3.2 Theoretical analysis
4 Practical algorithm-Self-improving Repetition
5 experiments and 5.1 experimental preparation
5.2 results and analysis
6 related work
7 conclusion and references
Excessive
Extension of organized preferences
B detailed proofs
C additional experimental details
5.2 results and analysis
We run many experiences face to face that control the input devices and input data. We often refer to the policy that is trained as “student” and GPT-4 as “teacher”; GPT-4 is also used as a comment when claiming.
SFT foundation The first foundation line is the Orca-2.5 itself, which is the MistRalai/Mistral-7B-V0.1 Raw model that was seized on a new collection of Orca-2 data (Mitra et al., 2023). This model has been destroyed for three era and achieved degrees shown at the top of Table 4. All other experiments in this study are prepared with EPOCH 1 of Orca-2.5. This is the solid horizontal line in Figure 2.
The second foundation line is to continue the Orca-2.5 training towards the positives in the superpowers (and hide the loss on the input claims). If the original positivity in this GPT-4-Turbo data collection is not, then we replace it with one group. This is the red line in Figure 2. It is clear that even contradictory training methods that are not connected to the Internet are more useful than the additional SFT, which indicates that the difference between positive and negative output provides a more valuable training signal than positive in isolation.
Liquidation of large margins for training couples: We have tried a simple DPO without contact with one flag on the highly leaving data. In control, we trained on all 63,000 preferences in the original data group, while in the treatment, we liquidated 42,000 pairs that meet the requirements of large margins that impose that positive degrees exceeded that negativity by at least 1.0 (out of 10) according to its GPT-4-Tobo behavior. Everything was equal. Although the treatment has been trained in lower steps on lower data, it has achieved the Alpacaeval 2.0 victory rate of 11.60 versus 9.60 for control, indicating that high -quality preference pairs are less than a louder amount than loud pairs (not appear in the tables).
Politics is better than outside politics One of the important questions in this study whether you want to take samples from the “quality” outputs from the current student to use in training pairs, or whether the outputs “outside the quality” that were collected from various other models are different
From the student will suffice. We have operated 4 era of DPO not connected to the Internet on the superior decline (which was liquidated by a large margin), and as shown in Table 1, DNO methods on the table, especially DNO DPO outside of quality, even when they were trained in 4 era while models were granted on politics only three repetitions. I remember that every repetition of the integrated quality training is seen only a third of the high -level entry data, while the DPO era is not connected to sees the entire data set.
Top quality reviews In our study, we use GPT-4-Turbo to provide explanatory comments for preference husbands. However, the self-coercive language model uses the Llama-2-70B model (Touvron et al Although the GPT-4-Turbo comments have not been reported with their restricted human signs, we believe that the presence of a high-quality broadcaster to start will lead to high-quality policies. Since both studies use superiority data, our explanatory suspension is based on the illustrative comments, we believe that there is a correct comparison.
We note that DNO has been prepared using a 7B base model that exceeds the 70B parameter renewal model on the same number of training repetitions (24.97 average rate against 20.44 on Alpacaeval 2.0, and 7.46 MT-Bench VS 7.25), at least partially due to the reference of high preferences. Watch the dark blue range against the gray line in Figure 2 and the opposite row in Table 1. However, unlike LM self-bonus, we saw a simple gain instead of logical thinking standards such as Arc-Challenge (Clark et al It is recognized that OpenLM’s evaluation predicts an answer with Max Logit opposite one of the multiple options, which do not comply with how to train these technologies.
Building a training pair One of the most important implementation questions in this study How to build training pairs that help the student’s policy to overcome a strong teacher Like GPT-4 Turbo. It removes one of the methods, which is the Finetuning game (SPIN), a step to clarify the preference, and automatically appoints the teacher to be positive, and all students ’negative samples (Chen et al., 2024). We find in our reorganization that this is harmful, most likely because this automatic task can lead to noisy training pairs in cases where the student may actually prefer. The resulting victory rate for the seven is only 16.13 after three era of repetitive training compared to 24.97 for DNO as shown in Table 1, and all of this is equal. Similar results held in OpenLM results in Table 3.
In the second experience, to which we refer to, we suspend all the pairs of preferences with GPT-4-Turbo as usual, but just recognize training couples where the teacher’s production is the preferred method. The difference between the Standard DNO and DNO in Table 2 where 0 student pairs-VS-TEACHER and Tudent-VS-STudent are created. The same also applies to the rotation, but the rotation will recognize a greater amount of examples of teachers associated with the loud teachers even when its preference is canceled: Table 2 shows that after repetition 2 of DNO repetition, there are only 9.9K ideas from the teacher who prefers him on the student, while the rotation would have spoken automatically about 100 thousand (5 samples x 20k).
Although DNO-RESTERTRICIVE is a little better (19.21 victory rate) of rotation, he still does not give the student an opportunity to compare his behavior with a strong teacher. The lack of this signal is a great supervision, because the last grade of Table 2 shows that through Iter 3, more than 64 % of DNO training data (32 thousand pairs) are cases in which the student is actually preferred over the teacher, a number that increases with repetition. We conclude that it is necessary “Allow the student to become a teacher.” Any learning is one of the comparisons in which its outputs prefer over a more powerful teacher.
One of the strange phenomena in Table 2 is that although the teacher’s outputs are stable at an early date, the broadcaster gives a little less degrees to the teacher with the student’s improvement; We are not sure if this is an unavoidable artifact from preference
Clear comments, or symptoms of a deeper problem. Also, the total amount of new “big margin” training couples (which is not taken from previous repetitions) in DNO declining with the improvement of policy through repetitions, but we do not have enough data to measure the extent of this to change in quality.
Look at future repetitions As a result, we have tried if the model could benefit from knowing the training pairs that it will generate if it could look at the future. We have tested this by running the thirties of DNO, collecting all preference pairs through repetitions, combining and mixing them, then re -training from the initial model. In essence, this turns DNO on the batch line into an educational algorithm that is not connected to the Internet we refer to the name of DNO-Slaowhead. We trained in one era on the three preference data. It has deteriorated more than we expected the Alpacaeval 2.0 winning rate (24.97 to 18.18), however, more surprisingly, the MT seats numbers dramatically improved (7.48 to 7.70). While the reasons for the relatively low connection between the MT seat and Alpacaeval 2.0 are not quite clear, it is important to consider the contrast in the size of the data sets. Given that MT-BECK consists of just 80 examples, while Alpacaeval 2.0 contains 10x more, we are gone that statistical importance and reliability of the results of Alpacaeval 2.0 are more confident.
DNO measures with more data: One of the reasons that make us divide the superpowers into three inconsistent sections is to avoid excessive involvement. Another strategy that should avoid getting involved in collecting more data is, so we have increased the 10 -data data factor based on data collections available to the public. We divide a large mixture of data collections into six unlawful sections of about 100,000 inputs each (and inferring GPT-4-Tobo’s outputs for all inputs), and we show that DNO-More-Data is well heading in this expanded system (see the purple line in Figure 2 and the last grade of Table 4.
We make some notes about the behavior of this experience: Given that every repetition is based on the outputs of the previous repetition, if there are any abnormal cases or errors in the critical ingredients such as explaining the preference, then these errors will spread and the only way to fight them is the “retreat” to the repetition that it provided. This can lead to lost time and costs, both of which are already high as shown in the appendix C. We doubt that the “depth” of repetitions concerns more than “offer” or the number of samples in each repetition, and moreover, the presence of an equal number of inputs may not be for each optimal repetition, but we have not tested this comprehensively. From the viewpoint of competence, although this algorithm “moods”, some improvements can be made, such as starting to comment on the policy outputs from which samples were taken soon and are ready instead of waiting for all inference functions.
The “explosion” lengths LLM training techniques, especially DPO, are known to lead to longer outputs than a widely suspected model in the form of “reward piracy”. It is strange that Table 2 shows that the largest jump comes after the first round of the contradictory training (repetition 1), where the lengths explode by at least 2 of 2 SFT preparation, before sliding again in the following repetition. We explain this “height of length” as a lost account to improve a false signal; We wish we were better equipped to control this phenomenon.