ML Model Serving API
Production-ready API for serving machine learning models with auto-scaling.
Tech Stack
Overview
Started this because deploying ML models at my internship was a mess - data scientists would hand over a Jupyter notebook and expect it to "just work" in production. Built a service that takes any PyTorch or TensorFlow model, wraps it in a FastAPI endpoint, containerizes it, and deploys to Kubernetes with auto-scaling. Data scientists push to a git repo, CI/CD handles the rest. Basically MLOps-in-a-box for small teams who don't want to hire a platform engineer.
The Problem
The gap between training a model and serving it in production is huge. Data scientists aren't DevOps experts, and shouldn't need to be. At the startup I interned at, models sat in notebooks for months because nobody knew how to deploy them properly. When we did deploy, it was janky Flask apps running on a single EC2 instance with no monitoring, scaling, or versioning. One model going down took everything else with it.
The Solution
Drop your model file and a requirements.txt in a specific repo structure, push to GitHub. My CI pipeline picks it up, validates it, wraps it in a FastAPI server with automatic input validation and output serialization, containerizes it, and deploys to Kubernetes with HPA (horizontal pod autoscaling). Health checks, metrics, logging - all automatic. Old versions stay up for rollback. A/B testing is just splitting traffic between deployments. Took our deployment time from 2 weeks to 20 minutes.
Technical Details
- •FastAPI for the serving layer - way faster than Flask, native async support, automatic OpenAPI docs
- •Dynamic model loading - server reads the model format (PyTorch .pth, TensorFlow SavedModel, ONNX, etc.) and loads the right inference engine
- •Request batching with a configurable window (default 100ms) to improve throughput - can process 50 requests in one forward pass instead of 50 serial calls
- •Model warmup on startup - runs dummy inferences to preload weights and compile operations. First request is fast, not the 10th.
- •Kubernetes HPA based on custom metrics (not just CPU) - scales on request queue depth and inference latency
- •Redis for caching predictions when you have deterministic models (helped reduce our compute costs by 30%)
- •Prometheus + Grafana for monitoring - track inference time, batch sizes, cache hit rates, error rates per model
- •Canary deployments built-in - new model version gets 5% traffic, gradually increases if error rates look good
- •Model registry (just a postgres table right now, might switch to MLflow) tracks versions, metrics, deployment history
Challenges
- •Cold start times were brutal - 30+ seconds for large models. Can't keep all versions running all the time, costs too much. Implemented a predictive pre-warming system that spins up pods 30s before anticipated traffic based on historical patterns. Still working on this.
- •Different models need wildly different resources - small NLP model needs 2GB RAM, large vision model needs 16GB + GPU. Built a tagging system where data scientists specify resource requirements, but they always underestimate. Added auto-profiling that runs the model with sample data and measures actual usage.
- •Version conflicts in the same container were a nightmare (model A needs TensorFlow 2.10, model B needs 2.13). Ended up giving each model its own container with its own dependencies. More disk space, but way less debugging.
- •GPU sharing is hard - you can't just run 4 models on one GPU without memory fragmentation issues. Using NVIDIA MPS now for better multi-tenant GPU usage, but it's still finicky.
- •Monitoring inference accuracy in production is tricky - you don't always have ground truth labels. Added a drift detection system that compares input distributions, but it's more of an early warning than a solution.
Results
- •Deployed 23 models in production (mix of internal tools and customer-facing features)
- •Reduced average deployment time from 12 days to 25 minutes
- •Serving ~2M predictions per day across all models
- •99.7% uptime over the last 6 months (two outages: one from AWS, one from me pushing bad config)
- •Auto-scaling has saved approximately $800/month in compute costs compared to fixed capacity
- •One model (fraud detection) prevented $45K in losses last quarter according to the finance team. Not bad.
- •Data scientists are actually using it, which is the real success metric