Model Compression
The process of making a model smaller is called model compression.
While there are many new techniques being developed, the four types of techniques that is often used is as follows:
- Low-Rank Factorization
- Knowledge Distillation
- Pruning
- Quantization
Let’s discuss about all these techniques in detail:
- Low-Rank Factorization: The key idea behind low-rank factorization is to replace high-dimensional tensors with lower-dimensional tensors.
2. Knowledge Distillation: It is a method in which a small model (student) is trained to mimic a larger or ensemble of models (teacher). The small model is what you’ll deploy.
3. Pruning: Pruning was a method originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification. As neural networks gained wider adopation, people started to realize that neural netwoks are over-parameterized and began to find a ways to reuce the workload caused by extra parameters.
4. Quantization: It is the most general and commonly used model compression method. It’s straightforward to do and generalizes over taks and architectures. Quantization reduces a model’s size by using fewer bits to represnet its parameters.