[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

shivam15s · 2024-11-08T22:18:55Z

austin362667 · 2024-11-13T02:30:55Z

take DPO

hongpeng-guo · 2024-11-13T10:04:05Z

I can take fused linear kl div. BTW, really nice illustration on the chunk linear op fusion from the paper. Very clear to new contributors 😄

pramodith · 2024-11-13T11:46:34Z

@shivam15s @ByronHsu I think we should also consider including some of the loss functions commonly used for training embedding models, especially the popular ones supported in Sentence transformers.

It's quite common for embedding models to require large batch sizes to be trained well. Coupled with the fact that their batch/input structure is kind of similar to RLHF where we have positive and negative pairs, I believe that this can prove to be useful. I'd recommend supporting CoSENTLoss, MatryokshaLoss and TripleLoss for starters https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss. Perhaps this can be its own roadmap separate to this one although the idea of chunking and fusing remains the same.

ByronHsu · 2024-11-13T17:59:07Z

@pramodith that is a good idea! do you know if the models in embedding also has large vocab and suffer from memory bottleneck?

pramodith · 2024-11-13T18:48:36Z

@ByronHsu most embedding models have a final Linear layer of shape (hidden_dim, hidden_dim), so vocab size doesn't really come into the picture for them so you're right to point it out, but it is common to have an effective batch size of 65k

ByronHsu · 2024-11-13T22:16:39Z

Then i think chunk loss is still helpful given the large batch size

pramodith · 2024-11-13T22:22:04Z

Then i think chunk loss is still helpful given the large batch size

Yes, I think so too. I can give this a try after we wrap up all the important RLHF and distillation losses. I'll also get Tom Aarsen's perspective since he's the lead of Sentence Transformers.

@shivam15s

## Summary Add support for a fused, torch-compiled, and chunked DPO ([Direct Preference Optimization](https://arxiv.org/html/2305.18290v3)) loss kernel, as requested in #371. This implementation is largely based on the excellent work done on ORPO (#362) by @shivam15s. ### DPO Loss Formulation In a reference setting (not reference free): $$r_\theta(x,y_c) - r_\theta(x,y_r) = \log(\pi_\theta(y_c|x)) - \log(\pi_\theta(y_r|x))$$ $$-\log(\sigma((\log(\pi_\theta(y_c|x)) - \log(\pi_\theta(y_r|x)) - \log(\pi_{\theta_{\text{ref}}}(y_c|x)) + \log(\pi_{\theta_{\text{ref}}}(y_r|x)))/\beta))$$ Corresponds to: ```python # Policy model log probabilities policy_chosen_logps = log_probs(policy_chosen_logits) policy_rejected_logps = log_probs(policy_rejected_logits) # Reference model log probabilities ref_chosen_logps = log_probs(ref_chosen_logits) ref_rejected_logps = log_probs(ref_rejected_logits) # Compute advantages chosen_advantages = policy_chosen_logps - ref_chosen_logps rejected_advantages = policy_rejected_logps - ref_rejected_logps # DPO loss logits_diff = (chosen_advantages - rejected_advantages) / beta losses = -F.logsigmoid(logits_diff) ``` In this PR: 1. The above mathematical equation shows that to maximize the reward difference, we get formula: $$r_θ(x_c) - r_θ(x_r)$$ 2. This can be further optimized using just: $$-log(σ((π_θ(x_c) - π_θ(x_r))/β))$$ 3. So, the code implements: ```python logits_diff = (chosen_logps - rejected_logps) / beta # (π_θ(x_c) - π_θ(x_r))/β losses = -F.logsigmoid(logits_diff) # -log(σ(logits_diff)) ``` 4. Sum up DPO and NLL: $$L_{DPO+NLL} = L_{DPO}+αL_{NLL}$$ ## Testing Done ![dpo_loss_memory](https://github.com/user-attachments/assets/d48965a2-bab7-4a81-9872-a43826106731) ![dpo_loss_speed](https://github.com/user-attachments/assets/10ab33c3-a905-435f-886b-67c911b8fff6) - Hardware Type: **NVIDIA L40S (48G)** - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [X] run `make test-convergence` to ensure convergence --------- Signed-off-by: Austin Liu <[email protected]> Co-authored-by: shivam15s <[email protected]>

pramodith · 2024-11-15T11:46:03Z

#take Simpo and Irpo since they are just extensions of CPO.

ByronHsu assigned shivam15s Nov 8, 2024

ByronHsu mentioned this issue Nov 13, 2024

[Liger Kernel] Collab on efficient alignemnt kernels turbo-llm/turbo-alignment#54

Open

austin362667 mentioned this issue Nov 13, 2024

Support Chunked DPO Loss Kernel #378

Merged

3 tasks

ByronHsu pinned this issue Nov 15, 2024

ByronHsu changed the title ~~[RFC] Liger FlexChunkLoss: Supporting various alignment and distillation loss functions~~ [RFC] Liger FlexChunkLoss: Alignment and Distillation loss Nov 15, 2024

ByronHsu mentioned this issue Nov 15, 2024

2024 Q4 Roadmap #285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

shivam15s commented Nov 8, 2024 •

edited by ByronHsu

Loading

austin362667 commented Nov 13, 2024

hongpeng-guo commented Nov 13, 2024

pramodith commented Nov 13, 2024 •

edited

Loading

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

pramodith commented Nov 15, 2024

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss #371

Comments

shivam15s commented Nov 8, 2024 • edited by ByronHsu Loading

🚀 The feature, motivation and pitch

Progress

Alignment

Distillation

Design

Approach Overview:

Key Findings

Interface

Alternatives

Additional context

austin362667 commented Nov 13, 2024

hongpeng-guo commented Nov 13, 2024

pramodith commented Nov 13, 2024 • edited Loading

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

ByronHsu commented Nov 13, 2024

pramodith commented Nov 13, 2024

pramodith commented Nov 15, 2024

shivam15s commented Nov 8, 2024 •

edited by ByronHsu

Loading

pramodith commented Nov 13, 2024 •

edited

Loading