Adds the CPO Alignment Loss Function #382

pramodith · 2024-11-14T17:15:21Z

Summary

CPO is almost the same as DPO with the major difference being that the Reference Model in CPO is assumed to be a Uniform distribution. This assumption leads to the cancellation of all terms related to the reference model.

$$CPOLoss = -\log(\sigma(\beta\log(\pi_\theta(y_c|x)) - \beta\log(\pi_\theta(y_r|x))))$$

This corresponds to equation 3 in the paper. Additionally CPO also assumes a scaling factor alpha for the NLL loss on the preferred response. In TRL this corresponds to the CPOTrainer using a loss_type="sigmoid"

We also refactor the test cases for chunked loss functions to include a generic HFAlignmentLoss base class that takes care some of the plumbing work to correctly generate batches of input, calculate the NLLoss etc. All future test cases can inherit from this class and just implement the alignment_loss function to compare implementation in the TRL lib versus the custom impl.

Testing Done

A100-80G-SXM

Benchmark Results:

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

pramodith added 2 commits November 14, 2024 17:02

Initial commit

0ac4e9b

Merge branch 'main' into pramodith/chunked_cpo_loss

1fee413

ByronHsu mentioned this pull request Nov 15, 2024

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds the CPO Alignment Loss Function #382

Adds the CPO Alignment Loss Function #382

pramodith commented Nov 14, 2024 •

edited

Loading

Adds the CPO Alignment Loss Function #382

Are you sure you want to change the base?

Adds the CPO Alignment Loss Function #382

Conversation

pramodith commented Nov 14, 2024 • edited Loading

Summary

Testing Done

pramodith commented Nov 14, 2024 •

edited

Loading