January 13, 2025
Welcome to Lesson 12 of 12 in our free course series, LLM Twin: Building Your…
In less than 50 lines of code, you can deploy a Bert-like model from the Hugging Face library and achieve over 100 requests per second with latencies below 100 milliseconds for less than $250 a month.
The code for this blog post available here: https://github.com/comet-ml/blog-serving-hugging-face-models.
Simple models and simple inference pipelines are much more likely to generate business value than complex approaches. When it comes to deploying NLP models, nothing is as simple as creating a FastAPI server to make real-time predictions.
While GPU accelerated inference has its place, this blog post will focus on how to optimize your CPU inference service to achieve sub 100 millisecond latency and over 100 requests per second throughput. One key advantage of using a Python inference service rather than more complex GPU accelerated deployment options is that we will be able to have the tokenization built-in further reducing the complexity of the deployment.
In order to achieve good performance for CPU inference we need to make optimisations to our serving framework. We breakdown down the post into:
Benchmarks are notoriously difficult [1], we highly recommend you create your own based on your specific requirements. We provide all the code used to reproduce the numbers presented below on Github here.
As we can’t test everything, we have had to make a number of decisions:
The code for the baseline inference service is available on GitHub here.
The baseline approach relies on the default parameters for FastAPI, PyTorch and Hugging Face. As we start optimising these libraries for our inference task, we will be able to compare the impact on the performance metrics.
Our baseline approach will use:
Thanks to the awesome work of both the Hugging Face and FastAPI teams, we can create an API in just a few lines of code:
We can then start the FastAPI server using: gunicorn main:app
Using this approach we obtain the following performance metrics:
* for this benchmark we used 2 concurrent users in the load testing software
Learnings
A simple Python API can serve up to 6 predictions a second, that is over 15 million predictions a month !
The code for the PyTorch and FastAPI optimized inference service is available on GitHub here.
In the baseline server we used the default configuration settings for both PyTorch and FastAPI, by making some small changes we can increase throughput by 25%.
Most of these optimisations come from a really great blog post by the Roblox team on how they scaled Bert to 1 billion requests a day [4].
Changes to PyTorch configuration:
torch.set_grad_enabled(False)
: During inference we don’t need to compute the gradientstorch.set_num_threads(1)
: We would like to configure the parallelism using Gunicorn workers rather than through PyTorch. This will maximise CPU usageChanges to FastAPI configuration:
gunicorn main:app --workers $NB_WORKERS
: Load a new model for each worker that will each use one CPU so that we can process requests in parallelIn order to understand the impact of these changes, we run a couple of benchmarks with the same number of concurrent users as we used for the baseline approach:
* for this benchmark we used 2 concurrent users in the load testing software
Looking at the benchmark above, we find that having the same number of workers as we have CPU cores is a good rule of thumb when configuring Gunicorn. Going forward we will be using this rule of thumb for all machine types.
By making some small changes to the way our models are served we have achieved a 25% increase in throughput compared to our baseline. In addition both the median latency and 95th percentile latency have decreased.
Learnings
When serving ML models, we should not be using PyTorch parallelism or FastAPI asynchronous processes and instead manage the parallelism using Gunicorn workers.
The code for the Model optimized inference service is available on GitHub here.
While Bert is a very versatile model, it is also a large model. In order to decrease latency and improve throughput there are two main strategies we can use:
While both options will improve inference latency, it will impact the accuracy of the model. We haven’t looked into the impact on accuracy but we can expect then drop in accuracy to be small [7].
Moving from Bert to a distilled version on Bert is very straightforward given we are using HuggingFace, all we need to do is change BertForSequenceClassification.from_pretrained('bert-base-uncased')
to DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
.
When using PyTorch, quantization is very easy to implement, all we need to do is call model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
. For Tensorflow models quantization is not as straightforward as you have to use either Tensorflow Lite or TensorRT which is much more temperamental. For this benchmark, we will use the PyTorch version of the model.
* for this benchmark we used 2 concurrent users in the load testing software
Learnings
Using quantization and distillation leads to a 300% increase in throughput and 300% decrease in latency
The code for the hardware optimized inference service is available here.
The hardware used to make the inference can also have a big impact on performance, having more CPUs allows us to process more concurrent requests for example.
In addition recent versions of Intel CPUs include optimisations for ML inference thanks to the newly released Intel Deep Learning Boost [4].
To understand the impact of this new instruction set, we run a new set of benchmarks using the Compute Optimized
machines on GCP running the new generation of Intel CPUs:
* for this benchmark the number of concurrent users was equal to the number of vCPUs
Learnings
By optimizing the hardware we use to run our ML inference server, we can increase throughput by 300% and decrease latency by 30%
Our baseline inference server could make up to 6 predictions per second with each prediction taking around 320 milliseconds.
By optimising how we made the predictions, utilizing quantization and distillation as well as the hardware used, we created an inference service that could make up to 68 predictions per second with each prediction taking about 60 milliseconds !
By optimizing our Python inference service, we have increased throughput by a factor of 10 (to 70 requests per second) and divided latency by 5 (to 60 milliseconds)!
If you would like to optimise your serving framework further, check out the series that Hugging Face have released: Scaling up BERT-like model Inference on modern CPU