ML Model Serving API

Production-ready API for serving machine learning models with auto-scaling.

Jan 2024

Tech Stack

PythonFastAPIDockerKubernetesTensorFlowPyTorchRedisPrometheus

Overview

Started this because deploying ML models at my internship was a mess - data scientists would hand over a Jupyter notebook and expect it to "just work" in production. Built a service that takes any PyTorch or TensorFlow model, wraps it in a FastAPI endpoint, containerizes it, and deploys to Kubernetes with auto-scaling. Data scientists push to a git repo, CI/CD handles the rest. Basically MLOps-in-a-box for small teams who don't want to hire a platform engineer.

The Problem

The gap between training a model and serving it in production is huge. Data scientists aren't DevOps experts, and shouldn't need to be. At the startup I interned at, models sat in notebooks for months because nobody knew how to deploy them properly. When we did deploy, it was janky Flask apps running on a single EC2 instance with no monitoring, scaling, or versioning. One model going down took everything else with it.

The Solution

Drop your model file and a requirements.txt in a specific repo structure, push to GitHub. My CI pipeline picks it up, validates it, wraps it in a FastAPI server with automatic input validation and output serialization, containerizes it, and deploys to Kubernetes with HPA (horizontal pod autoscaling). Health checks, metrics, logging - all automatic. Old versions stay up for rollback. A/B testing is just splitting traffic between deployments. Took our deployment time from 2 weeks to 20 minutes.

Technical Details

•FastAPI for the serving layer - way faster than Flask, native async support, automatic OpenAPI docs
•Dynamic model loading - server reads the model format (PyTorch .pth, TensorFlow SavedModel, ONNX, etc.) and loads the right inference engine
•Request batching with a configurable window (default 100ms) to improve throughput - can process 50 requests in one forward pass instead of 50 serial calls
•Model warmup on startup - runs dummy inferences to preload weights and compile operations. First request is fast, not the 10th.
•Kubernetes HPA based on custom metrics (not just CPU) - scales on request queue depth and inference latency
•Redis for caching predictions when you have deterministic models (helped reduce our compute costs by 30%)
•Prometheus + Grafana for monitoring - track inference time, batch sizes, cache hit rates, error rates per model
•Canary deployments built-in - new model version gets 5% traffic, gradually increases if error rates look good
•Model registry (just a postgres table right now, might switch to MLflow) tracks versions, metrics, deployment history

Challenges

•Cold start times were brutal - 30+ seconds for large models. Can't keep all versions running all the time, costs too much. Implemented a predictive pre-warming system that spins up pods 30s before anticipated traffic based on historical patterns. Still working on this.
•Different models need wildly different resources - small NLP model needs 2GB RAM, large vision model needs 16GB + GPU. Built a tagging system where data scientists specify resource requirements, but they always underestimate. Added auto-profiling that runs the model with sample data and measures actual usage.
•Version conflicts in the same container were a nightmare (model A needs TensorFlow 2.10, model B needs 2.13). Ended up giving each model its own container with its own dependencies. More disk space, but way less debugging.
•GPU sharing is hard - you can't just run 4 models on one GPU without memory fragmentation issues. Using NVIDIA MPS now for better multi-tenant GPU usage, but it's still finicky.
•Monitoring inference accuracy in production is tricky - you don't always have ground truth labels. Added a drift detection system that compares input distributions, but it's more of an early warning than a solution.

Results

•Deployed 23 models in production (mix of internal tools and customer-facing features)
•Reduced average deployment time from 12 days to 25 minutes
•Serving ~2M predictions per day across all models
•99.7% uptime over the last 6 months (two outages: one from AWS, one from me pushing bad config)
•Auto-scaling has saved approximately $800/month in compute costs compared to fixed capacity
•One model (fraud detection) prevented $45K in losses last quarter according to the finance team. Not bad.
•Data scientists are actually using it, which is the real success metric