A modest smartphone will have 10 - 30 teraflops of compute power - how many AI models may be running simultaneously on a device at any one time. Especially, considering that the latest versions of Android and iOS are being jump-packed with AI. How does this influence AI product development 😮

<aside> 💡 How do we build/deploy ML for on-device efficiently? (Reduce Latency, Improve Privacy, Improve Efficiency)

</aside>

Step 1: Model Conversion from a PyTorch/Tensor flow to a format compatible with the on-device runtime. The model is frozen into a neural network graph which is converted into an executable for the device.

Step 2: Model Optimisation. Devices such as smartphones and edge devices contain a mix of processing units such as CPUs, NPUs (Neural Processing Units) and GPUs - we need to understand the exact processing units for better model optimisation. This becomes complex when you deploy an app across 300+ smartphone types.

To ensure similar/consistent performance, it is important to ensure this by validating the on-device numerical correctness across a broad range of devices.

Step 3: Quantisation. Reduce model size (4x +) and faster performance. Here we can explore weight vs activation quantisation and post-training quantisation vs quantisation aware training.

Step 4: When integrating Model with on-device App, some important considerations are:

  1. Model Packing (packing for highest computational efficiency)
  2. Libraries: Breakdown dependencies for optimal C/G/NPU delivery
  3. Application Size: Bundled vs over the air

Deep Dive: link