Chaining Multiple Optimizations Techniques

This example demonstrates how to chain multiple optimization techniques like Pruning, Distillation, and Quantization together to achieve the best performance on a given model.

HuggingFace BERT Pruning + Distillation + Quantization

This example shows how to compress a Hugging Face Bert large model for Question Answering using the combination of modelopt.torch.prune, modelopt.torch.distill and modelopt.torch.quantize. More specifically, we will:

Prune the Bert large model to 50% FLOPs with GradNAS algorithm and fine-tune with distillation
Quantize the fine-tuned model to INT8 precision with Post-Training Quantization (PTQ) and Quantize Aware Training (QAT) with distillation
Export the quantized model to ONNX format for deployment with TensorRT

The main python file is bert_prune_distill_quantize.py and scripts for running it for all 3 steps are available in the scripts directory. More details on this example (including highlighted code snippets) can be found in the Model Optimizer documentation here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Chaining Multiple Optimizations Techniques

HuggingFace BERT Pruning + Distillation + Quantization

Files

README.md

Latest commit

History

README.md

File metadata and controls

Chaining Multiple Optimizations Techniques

HuggingFace BERT Pruning + Distillation + Quantization