Designing a Multi-Agent System for Engineering Support at Scale: A Case Study From Grab
Designing a Multi-Agent System for Engineering Support at Scale: A Case Study From Grab
Grab’s Analytics Data Warehouse (ADW) team has introduced a multi-agent AI system to automate engineering support workflows across its large-scale data platform, aiming to reduce repetitive operational work and improve resolution efficiency. The system is designed to handle internal engineering requests spanning data warehouse troubleshooting, SQL debugging, and platform support, while shifting engineers toward higher-value development work.
The ADW platform supports more than 1,000 internal users and manages over 15,000 tables, serving as a core analytics infrastructure component within Grab. As usage grew, the engineering team observed that a significant portion of operational effort was being consumed by repetitive support tasks and ad hoc investigations, limiting time available for platform improvement and system design work.
Sneh Agrawa, Head of Analytics @ Grab, in a LinkedIn post highlighted, Grab’s Central Data Team is leveraging a multi-agent system to automate repetitive operational work, reclaiming hundreds of engineering hours each month. This shift is unlocking critical engineering bandwidth and enabling a transition from reactive firefighting to higher-value system building.
To address this, the team implemented a multi-agent architecture that separates incoming engineering requests into two primary workflows: investigation and enhancement. Investigation workflows are designed for diagnostic tasks such as query analysis, log retrieval, schema lookup, and issue summarization. Enhancement workflows focus on generating actionable outputs, including code changes, SQL fixes, and automated merge requests for review.
Multi-agent architecture tech stack (Source: Grab Tech Blog Post)
The system is orchestrated using a LangGraph-based workflow engine combined with FastAPI services that coordinate routing, tool execution, and state management across agents. Requests are first classified and then routed to specialized agents responsible for tasks such as context retrieval, code search, or solution generation. Each agent operates with constrained responsibilities to reduce ambiguity and improve the predictability of outputs.
Agent workflows, using a Supervisor that controls communication flow and task delegation (Source: Grab Tech Blog Post)
According to Grab engineers, The separation of investigation and enhancement paths helped us reduce complexity in agent reasoning and improved reliability in production workflows.
A key architectural decision was the consolidation of the tool ecosystem. The system initially exposed more than 30 internal tools across data access, logging, and code systems. This was later reduced to a smaller, curated toolset to improve maintainability and reduce unpredictable tool selection by agents. The tool layer includes controlled SQL execution, metadata access, log retrieval systems, and integration with Git-based workflows for change management.
Safety and governance were integrated into the system design. SQL execution is constrained through validation layers, and sensitive data handling includes mechanisms for detecting and mitigating exposure risks. In addition, all enhancement workflows that produce code changes require human-in-the-loop review before deployment, ensuring that automated outputs remain subject to engineering oversight.
Context management emerged as a significant technical challenge, as agents needed to understand the context of each request to provide accurate and relevant responses. To address this, the team developed a context management system that uses natural language processing (NLP) to analyze the request and identify the relevant context.
The multi-agent system has been deployed in production and has shown significant improvements in resolution efficiency and reduction of repetitive operational work. The team plans to continue iterating and improving the system to further enhance its capabilities and scalability.
Comments (0)
Login or Register to apply