Neural Network Quantization
This article is a short summary of quantization techniques.
- What is quantization?
- Fake Quant
- Min & Max
- quantization for LSTM/ RNN/ GRU
What is quantization?
From 32-bit floating point representation to 8-bit fixed-point representation.
Why Quantization?
- Arithmetic with lower bit-depth is faster
- In moving from 32-bits to 8-bits, we get (almost) 4x reduction in memory
straightaway . Less storage space, smallerbandwith required. - Floating point arithmetic is not supported on some embedded devices.
Why does it work?
- First, DNNs are known to be quite robust to noise and other small perturbations once trained.
- The weights and activations by a particular layer often tend to lie in a small range, which can be estimated beforehand. This means we don’t need the ability to store 10⁶ and 1/10⁶ in the same data type — allowing us to concentrate our precious fewer bits within a smaller range, say -3 to +3.
Why still training with FP32?
Models are trained using very tiny gradient updates, for which we do need high precision.
Fake Quantization
Quantization is TensorFLow
Quantization in TensorFlow Lite
Reference
8-Bit Quantization and TensorFlow Lite
How to Quantize Neural Networks with TensorFlow – Pete Warden