A modest smartphone will have 10 - 30 teraflops of compute power - how many AI models may be running simultaneously on a device at any one time. Especially, considering that the latest versions of Android and iOS are being jump-packed with AI. How does this influence AI product development 😮
<aside> 💡 How do we build/deploy ML for on-device efficiently? (Reduce Latency, Improve Privacy, Improve Efficiency)
</aside>
Step 1: Model Conversion from a PyTorch/Tensor flow to a format compatible with the on-device runtime. The model is frozen into a neural network graph which is converted into an executable for the device.
Step 2: Model Optimisation. Devices such as smartphones and edge devices contain a mix of processing units such as CPUs, NPUs (Neural Processing Units) and GPUs - we need to understand the exact processing units for better model optimisation. This becomes complex when you deploy an app across 300+ smartphone types.
To ensure similar/consistent performance, it is important to ensure this by validating the on-device numerical correctness across a broad range of devices.
Step 3: Quantisation. Reduce model size (4x +) and faster performance. Here we can explore weight vs activation quantisation and post-training quantisation vs quantisation aware training.
Step 4: When integrating Model with on-device App, some important considerations are:
Deep Dive: link