Scaling a complex system of machine learning models while delivering real-time insights is no small feat. ZestyAI’s engineering team reimagined its architecture to overcome these challenges, leveraging NVIDIA’s Triton Inference Server and introducing the “Monster Pod.” This transformation halved API response times, increased throughput by 10x, and cut cloud costs by 75%. Dive into how strategic experimentation and innovative design unlocked efficiency and positioned ZestyAI for future growth.
By Andrew Merski, VP, Engineering
The Challenge: A Complex and Scaling System
Business Context
At ZestyAI, we deliver critical insights to insurance clients using machine learning models. Our API processes a significant volume of data, including imagery, geolocation, and structured data, to produce real-time results. The complexity of each request places immense demands on our infrastructure:
- Synchronous API Calls: Each request must be processed in real-time, with all insights delivered back to the client in a single response. Low latency is non-negotiable, as our clients’ workflows rely on immediate feedback.
- Multiple ML Models Per Request: Each request may invoke up to 30 ML models, ranging from computer vision models analyzing aerial imagery to models synthesizing geospatial and tabular data.
- Growing Model Catalog: The catalog of ML models we deploy continues to expand, driven by both customer needs and internal innovation. Each new model adds additional complexity to the system.
- Exceptional Reliability: Our clients in the insurance sector demand a system that operates flawlessly, with uptime and accuracy critical to their decision-making processes.
Previous Architecture: A Decentralized Model
In our previous system, each ML model operated as an independent microservice. Each model scaled independently, and each instance required its own GPU. While functional, this architecture introduced critical issues:
- Resource Underutilization: GPUs were underutilized, with non-GPU tasks consuming significant time.
- Scaling Challenges: Periods of high API traffic put additional strain on system components, leading to some inefficiencies.
- Capacity Limitations: The previous architecture had constraints that limited scalability, which could have restricted future growth.
This architecture also resulted in significant operational complexity. Each model’s independent deployment meant substantial manual effort in testing, scaling, and troubleshooting. Cloud costs also escalated rapidly as new models were added, creating diminishing returns for each improvement in service quality.
The Solution: A Centralized Architecture with Triton
Faced with scaling challenges and rising customer demand, we reimagined the entire architecture. At the heart of the solution was NVIDIA’s Triton Inference Server, a tool designed for efficient multi-model serving.
Why Triton?
Triton enabled:
- Shared GPU resources across models.
- Ensemble models to define workflows using configuration rather than code.
- Extensive benchmarking tools for performance optimization.
- Support for various backends, including Python and Pytorch.
However, Triton required significant investment in layers of customization to meet our needs. Its low-level interface and lack of native autoscaling demanded a tailored implementation.
New Architecture: The Monster Pod
To maximize Triton’s potential, we introduced the “Monster Pod,” consolidating all models and supporting microservices into a single Kubernetes pod. Key features included:
- Single-host model serving: All models resided in a unified Triton instance.
- Integrated workflow management: The workflow orchestrator and other microservices were co-located with Triton.
- Streamlined scaling: Each pod functioned as an independent unit, simplifying horizontal scaling.
This “Monster Pod” approach offered numerous benefits:
Improved Resource Utilization
- Maximized GPU usage by serving multiple models per instance.
- Reduced the overhead associated with multiple nodes and microservices.
Simplified Testing and Benchmarking
- Each pod contained all necessary components, enabling comprehensive testing in isolation.
- Benchmarking provided clear insights into throughput and resource requirements.
Reduced Scaling Overhead
- Eliminated dependency on Istio for internal traffic management.
- Simplified node provisioning and scheduling.
Predictable Costs
- Each pod corresponds to a fixed node cost, allowing accurate cost planning.
Lessons Learned
This project revealed critical insights that extend beyond Triton or even ML systems:
1. The "Microservices vs. Monolith" Debate Isn’t Binary
Architectural decisions don’t have to be all-or-nothing. For instance, while our deployment consolidated models into a single pod, we retained microservices for other aspects of the platform. Evaluating “single vs. many” decisions at multiple levels allowed us to optimize each layer independently.
2. Understand the Bottlenecks Before Designing Solutions
Identifying the root causes of inefficiency—scaling overhead, resource underutilization, network traffic—helped us design a system that addressed these challenges holistically rather than incrementally.
3. The Power of Consolidation
Integrating multiple components into a single deployment reduced complexity, improved performance, and simplified scaling. This approach may not suit every scenario, but in our case, it delivered transformative results.
4. Be Open to Temporary Solutions (Flexibility Leads to Innovation)
The “Monster Pod” started as a quick workaround but became a permanent fixture due to its outsized impact. Being open to experimentation unlocked unexpected benefits, such as easier resource planning and reduced operational complexity.
Business Impact
Rebuilding our ML inference platform was a bold move that paid off. The new architecture produced dramatic improvements across key metrics:
- Latency: API response times were halved.
- Capacity: System throughput increased by 10x, eliminating the previous capacity ceiling.
- Cost Efficiency: Cloud costs for model serving dropped by 75%.
These gains position us to scale with growing demand while maintaining industry-leading performance. Additionally, the simplified architecture has freed up engineering resources to focus on innovation rather than maintenance.
While Triton Inference Server played a critical role, the real success lay in our architectural decisions and willingness to rethink the status quo. This project underscores the value of experimentation and the importance of tailoring solutions to meet unique challenges.
The lessons learned from this journey will continue to inform our approach to system design and scalability as we look ahead. The Monster Pod has not only transformed our current capabilities—but has also set the stage for future growth and innovation.
For a deeper dive into the technical details, check out Andrew Merski’s original blog on Medium.