When deploying real-time models to Qwak, we receive many questions regarding performance and optimization:
- How should I configure my real-time inference endpoint during a deployment?
- Which configuration options matter the most?
- How many workers should I choose to best scaling?
We summarized in this document some of the common issues and topics to help you answer these pressing questions.
Number of replicas is the number of instances deployed in Kubernetes.
The load balancer splits the traffic between the number of replicas, the bigger the number, the more live replicas are deployed.
- Instance size determines the number of vCPUs, RAM memory and GPU specifications of each replica.
- Number of workers determines the number of forked processes within each replica.
- Maximal batch size is the number of rows in the
DataFramereceived in the model's predict function.
When modifying the pod configuration does not when traffic increases
When using a large number of cheaper pods.
When expecting a spike in traffic increasing the number of replicas temporarily
- If you want to use more workers and handle multiple requests in parallel.
Don't waste vCPUs!
If you don't increase the number of workers but increase vCPUs, you will waste resources! Those additional CPUs won't be used.
In general, ML inference is a CPU-bound process, so we should follow the rule of having 1 vCPU per two worker processes. Of course, if you run a simple model, you may try increasing the number of workers per vCPU.
Increase the number of workers if you need to handle more traffic and your pods still have some unused CPU capacity and RAM.
- When increasing the number of workers on each pod.
Every worker runs as a separate forked process, so there is no shared memory. In every worker, you have to load the inference service and the model.
- When you have increased the max batch size per prediction request, you constantly send enough data to fill the entire batch and your CPUs don't keep up anymore.
Don't waste GPUs!
Don't deploy a GPU instance if you process requests one by one! GPUs exist to parallelize the computation. When you process a batch of size 1, a GPU won't give you any performance improvements.
If your code in the predict function and the model can handle more than one value (preferably without iterating over them in the predict function).
If you can group requests into batches (you have enough data to send and the client application can handle that).
Updated 3 days ago