From Infrastructure to Intelligence (Part 1): Strategic Foundations for AI Model Hosting and Agent-Based Architectures on the Cloud
What began as simple chatbot prototypes has evolved into full-fledged agent architectures.

1. Introduction: Why AI Infrastructure Strategy Matters Now
The rapid evolution of LLMs and agent-based systems
The past year has seen a rapid acceleration in large language models (LLMs) and AI agent technologies, transforming how businesses build intelligent systems. What began as simple chatbot prototypes has evolved into full-fledged agent ecosystems capable of reasoning, decision-making, and tool integration. As these systems mature, companies are no longer satisfied with relying solely on third-party APIs — they want control, customization, and cost-efficiency. Hosting your own AI models on the cloud unlocks strategic advantages: performance tuning, data security, and full ownership of the user experience.
Why hosting your own models and designing intelligent agents is a key differentiator
At the same time, AI agents are becoming the new abstraction layer for intelligence, allowing teams to build dynamic workflows that go far beyond static prompt engineering. These agents act like digital co-workers — orchestrating tools, retrieving knowledge, and adapting to user goals in real time. When model infrastructure and agent design are strategically aligned, companies can deliver smarter, faster, and more scalable AI products. This blog post explores the key components of that strategy: how to host AI models on the cloud and how to design agent workflows that truly unlock their potential. Whether you’re a tech leader planning your AI roadmap or a developer architecting the next breakthrough system, this guide will help you navigate the journey from infrastructure to intelligence.
2. Cloud-Hosted AI Models: Enabling Scalable, Custom Intelligence
The shift from API-only to hybrid/self-hosted model strategies
As AI adoption deepens, organizations are shifting from an API-only mindset to hybrid or fully self-hosted model strategies that offer greater flexibility and control. While APIs from providers like OpenAI and Anthropic offer convenience and speed to market, they come with limitations around cost, latency, customization, and data handling. Cloud-hosted models — whether deployed via managed services or open-source stacks — give teams the power to shape their AI infrastructure around their unique needs.
Unlocking the Benefits of NIM (NVIDIA Inference Microservice)
NVIDIA NIM offers a streamlined way to deploy high-performance AI models as containerized microservices directly on leading cloud platforms. By leveraging NIM, organizations get optimized GPU acceleration out of the box, reducing inference latency and improving throughput without deep infrastructure tuning. It simplifies the deployment of popular models — including LLMs, vision models, and custom fine-tuned models — using standard APIs and containers. On CSPs like AWS, Azure, and GCP, NIM integrates seamlessly with GPU instances, allowing teams to scale inference workloads dynamically and cost-effectively. This enables enterprises to bring cutting-edge AI capabilities to production faster, while maintaining flexibility, performance, and cloud-native operational efficiency.
3. The Rise of AI Agents: Beyond Prompt Engineering
From Single Prompts to Autonomous Behaviors
AI agents represent the next evolution of intelligent systems — moving beyond simple prompt-response interactions to goal-oriented, autonomous behaviors. Unlike traditional LLM use, where outputs depend solely on prompts, agents are designed to reason, plan, and act using a combination of tools, memory, and multi-step logic. This shift transforms static models into dynamic systems capable of executing complex workflows and adapting to real-time context.
Why AI Agents Matter Now
What makes AI agents powerful is their ability to integrate APIs, databases, search engines, and even other AI models to achieve specific outcomes. They bring contextual awareness, self-correction, and the ability to chain together actions that align with a user’s intent — things that prompt engineering alone can’t accomplish.
Real-World Use Cases Gaining Traction
We’re seeing rapid adoption across industries: AI agents are powering customer support bots that escalate issues intelligently, internal copilotsthat automate repetitive tasks, and retrieval-augmented generation (RAG)systems that combine model outputs with live knowledge. In enterprise environments, they function as digital coworkers — able to draft documents, analyze data, generate reports, or coordinate across systems.
The Tooling Behind Agent-Based Systems
The rise of orchestration frameworks like LangChain, CrewAI, OpenAI Agent and Microsoft Autogen has accelerated this shift by providing the building blocks for agent design. These tools make it easier to manage agent state, control tool access, and align actions with business logic. As the complexity of AI applications increases, the agent paradigm becomes essential for scaling intelligence across real-world workflows. If you would like to find out more about agents, take a look at our Agentic AI Primer.

4. Strategic Design of AI Agent Systems
The Building Blocks of Modern Agents
Designing effective AI agents requires more than just calling a language model — it involves building a system that can reason, take action, and adapt over time. At the core is the reasoning loop — the iterative process where the agent reflects on its current state, selects a tool or action, evaluates the result, and decides what to do next. Tools and APIs act as the agent’s “hands,” enabling it to interact with external systems such as databases, web services, or productivity apps. Memory and long-term context give agents continuity — they allow agents to remember prior interactions, user preferences, and task history, which is essential for sustained, human-like collaboration.
Orchestration Platforms: The New Operating Layer
To manage these components cohesively, companies are turning to orchestration platforms like LangGraph, CrewAI, Microsoft AutoGen and OpenAI Agent SDK. These frameworks provide infrastructure for defining workflows, chaining actions, managing agent state, and handling fallback or retry logic. With these platforms, teams can modularize agent behavior, plug in custom tools, and apply guardrails to ensure safe and predictable outputs.
Aligning Agents with Business Logic
One of the most strategic aspects of agent design is embedding business logic directly into the agent’s decision-making flow. This allows organizations to align AI behavior with operational goals, compliance needs, and customer expectations — turning agents from generic assistants into purpose-built digital teammates. By combining robust infrastructure, thoughtful design, and the right orchestration stack, companies can unlock agents that are not only powerful, but also deeply aligned with their mission and workflows.
5. Integrating Hosted Models with Agent Workflows
Choosing the Right Interface for Model Access
When connecting agents to hosted models, the interface you choose determines how smoothly the system operates. Most modern frameworks support OpenAI-compatible APIs, making it easy to plug in self-hosted models that mimic the familiar API spec. Alternatives like REST or gRPCoffer more control, especially for high-performance or low-latency use cases, depending on your infrastructure setup and programming environment.
Balancing Security, Performance, and Scalability
Security and access control become critical when exposing hosted models to agent workflows — especially in enterprise or production environments. Consider implementing API gateways, token-based authentication, and rate limiting to safeguard model endpoints. Performance tuning is equally important; use autoscaling, GPU utilization monitoring, and request batching to maintain responsiveness during peak load. Cloud-native orchestration (e.g., on Kubernetes) makes it easier to scale inference services horizontally as agent demand increases.
The Model as the Agent’s Brain
In an agent-based architecture, the hosted model serves as the core reasoning engine — the “brain” that interprets context, evaluates decisions, and generates intelligent responses. All agent behavior — from tool selection to action justification — flows through this core. By hosting your own model, you gain control over everything from temperature settings to system prompts, enabling you to fine-tune how the agent thinks and reacts. This tight integration between model and agent unlocks deeper personalization, smarter decisions, and a system that truly aligns with your organizational DNA.
6. How Agentic Multimodal Systems Work: A Pipeline Walkthrough
This workflow showcases a fully agentic, multimodal pipeline that transforms human voice into intelligent, responsive, and animated digital interactions. It begins with NVIDIA Riva ASR, converting speech to text, and passes it to a LangGraph-managed agentic system using LLMs and VLMs. Agents collaborate to perform decision making, and retrieve contextual knowledge via LangChain and Cohere. The processed output is then turned into speech by Riva TTS and animated through NVIDIA Omniverse Audio2Face. The agentic architecture ensures modularity, adaptability, and clear responsibility separation across agents. It supports function-calling, multimodal understanding, and dynamic tool integration. LLMs like GPT-4 and Leonardo enable sophisticated visual reasoning. All models are hosted through a mix of on-prem/cloud GPU servers and external APIs. This setup is ideal for real-time avatars, assistants, and embodied AI systems. The entire pipeline reflects the power of orchestration through agentic design. It includes the following important design elements:
- Modular Intelligence: Each agent in LangGraph handles distinct tasks.
- Multimodal Reasoning: Combines text, audio, and image understanding using LLMs and VLMs.
- Real-Time Avatars: Voice and facial animation are synthesized end-to-end using Riva and Omniverse.

High Level workflow for multi-modal agentic pipeline
Background of the Problem Statement
In teaching oral communication such as PSLE preparation, teachers face constraints such as high student-to-teacher ratios and limited time for personalized feedback. On the student side, they often struggle to structure responses and communicate effectively, with little opportunity for customized feedback. This example aims to provide immediate, personalized feedback to overcome these issues.

Imaginative Scenario for PSLE Oral Exam Preparation
Agentic Workflow Overview
- User Voice Input — A human speaks to initiate interaction.
- ASR (Automatic Speech Recognition) — NVIDIA Riva ASR transcribes speech to text.
- Agentic Workflow — This is the brain of the system, it includes:
- LangGraph — A Python framework for creating multi-agent workflows, it manages LLM orchestration in an agentic structure: each agent has a role and its own tools. It enables: State-based transitions between LLM agents, Memory persistence across steps, Tool calling (functions, APIs), and Control flow (e.g., branching, retries).
- LangChain + Cohere (Function Calling & RAG) — It serves as the middleware that provides: tools as document loaders, and enabling RAG .
- Leonardo (Text to Image) — Accepts natural language input and generates images to support simulated question-and-answer practice for students.
- GPT-4o (Image to Text) — Used for visual understanding (e.g., captioning an image). Once the image description is captured in the system, it’s used by the application before asking questions to the students.
4. TTS (Text to Speech) — NVIDIA Riva TTS synthesizes the LLM-generated response into speech.
5. Audio to Face (A2F) — NVIDIA Omniverse Audio2Face generates lip-sync and facial animation based on the input audio.
Model Hosting Considerations

Hosting options for various components in the multi-modal agentic pipeline
7. Modular Agentic AI Workflow with RAG, Web Search, and Guardrails

Example of modular agentic design workflow
This workflow represents a modular AI system designed with an agentic architecture, allowing specialized agents to handle distinct user intents. Inputs pass through Input Rails, which apply safety and validation checks before routing the request to the appropriate agent via an Input Router.
Depending on the intent, the input may be handled by domain-specific agents (e.g., Customer Service Agent or Oral Coach Agent). In Customer Service Agent, it will decide whether to go through Web Search or a Retrieval-Augmented Generation (RAG) pipeline. If Retrieval or Web Search fails to provide relevant content, the system performs a LLM fallback to generate a response. All outputs pass through Output Rails, enforcing content quality, compliance, and safety rules. This layered setup ensures the system is resilient, context-aware, and modular. It also supports easy scaling by plugging in new agents as needed. The combination of RAG, web search, and fallback mechanisms ensures robust, informative answers.
Agentic Design
It enables routing to task-specific agents for accurate handling of diverse queries. Here are the key components:
Input Router acts as the central intelligence deciding which agent to usebased on user intent. For example:
- “I want to practice my English speaking” → routes to Oral Coach Agent
- “How are you?” → routes to Customer Service Agent
Specialized agents are isolated modules with proper Prompt Engineering to handle specific tasks. Each has its own generation logic, knowledge base , post-processing and output formatting
[Others…] indicates modular extensibility: more agents can be added as needed.
Agent separation aligns with the agentic principle of single-responsibility, enabling better performance, easier debugging, and scaling.
Retrieval-Augmented Generation (RAG) and Web Search
It ensures responses are grounded in relevant internal or external data, which could be further illustrated by the following flow and elements:
- After Input Analysis, the system decides whether to use internal (RAG) or external (Web Search) data sources to try to Retrieve relevant information (e.g., documents, knowledge base).
- If retrieval (either from or external source) is relevant to the user input, it continues with LLM-based generation.
- The Generate step (top path) combines user query and retrieved content to produce a grounded answer.
- This answer passes through Output Rails and is returned as Answer with RAG or web search.
3. If retrieval fails (not relevant), the system proceeds to LLM Fallback to generate without augmented data source.
Guardrails (Input & Output Rails)
It enforces safety and policy compliance across all inputs and outputs, whether they come from RAG, Web Search, or direct agent response.
Input Rails:
- Filter or validate user input before any processing.
- May include profanity filtering, prompt injection defense, content moderation and Intent classification.
Output Rails:
- Check the generated output before showing it to users.
- Can enforce content policies, toxicity checks and fact-checking.
Summary

Summary of different components in modular agentic AI workflow
8. Challenges and the Path Forward
This video showcases a prototype of the Agentic Multimodal Interaction Pipeline, developed during the GovTech Launch Hackathon 2024. Due to the time constraints of the two-week sprint, we delivered a functional proof of concept; however, several key challenges remain. These limitations highlight the opportunities for future development and refinement.
Limited Realism in Avatar Appearance
The current avatar design is relatively basic and lacks the realism needed for an engaging user experience. Improvements in facial features, texture quality, and visual expressiveness are necessary to better reflect human-like interactions. Future work will explore more customizable and lifelike avatar renderings to enhance immersion.
Lacking Representation of Body Language
The prototype has no body language, which affects the avatar’s ability to convey emotions and intentions naturally. More dynamic and context-aware gestures are needed to align with spoken content and user expectations. Integrating motion capture or AI-driven animation can significantly improve non-verbal communication.
Performance Bottlenecks in the System Pipeline
The current system prototype experiences high latency and instability during interaction. Optimizing the pipeline for faster response times and scalability will be crucial for real-world deployment. Efforts will also focus on improving backend efficiency and reducing computational overhead.
To address these challenges, I’ll be sharing a follow-up blog post that dives into the specific strategies and technical approaches we’re taking. Part 2 will cover how we’re enhancing avatar realism, adding body language, and optimizing system performance. Stay tuned for a deeper look into the improvements and what’s next for the Agentic Multimodal Interaction Pipeline.
9. Appreciation
We extend our heartfelt thanks to Simon Raj Panirsilvam and Melvin Chin from Alexandra Primary School for presenting this meaningful problem statement. Their initiative and collaboration with GovTech ignited an inspiring journey into the application of multimodal and agentic AI in educational technology. This vision laid the foundation for exploring innovative ways to enhance learning experiences by bridging the gap between classroom needs and cutting-edge technology. We deeply appreciate their partnership and invaluable contributions.