Preference alignment in LLM training aims to improve an LLM's behavior by forcing it to follow rules and preferences. It could related to stopping offensive language or some other restriction.
Some approaches to preference alignment are detailed in this blog post from Miguel Mendez. There are a number of known techniques for this - these include:
PPO: Proximal Policy Optimization
DPO: Direct Preference Optimization
ORPO: Optimization without Reference Model
For preference alignment we usually need data which is good or bad. Human annotation of such data is often expensive and in some cases a clear "winner" in terms of contrasting data points is not decidable. With KTO two answers can both be regarded as good. This arguably is closer to reality.
KTO stands for Kahneman-Tversky Optimization and is detailed more in a blog post from contextual.ai.
The research paper on KTO should be read to understand how to construct the relevant KTO loss function.
No comments:
Post a Comment