Abstract

Latency is the central systems problem in multimodal agents. Once an application combines image encoding, text reasoning, retrieval, function calling, and downstream APIs, user-perceived delay becomes the sum of several heterogeneous stages rather than the cost of a single model call. Recent serving work shows that multimodal requests suffer from stage imbalance, head-of-line blocking, and poor resource utilization when vision encoding, language prefill, and decode are scheduled as one monolithic unit. Recent agentic inference work further shows that tool calls often dominate first-token latency and that decoupled orchestration layers destroy cache locality across iterative calls.

This paper synthesizes current research and vendor guidance into a practical design for latency-aware multimodal agent orchestration. The central argument is that low-latency multimodal agents require four decisions to be made jointly rather than independently: stage-aware multimodal scheduling, dynamic model routing, prefix-aware cache placement, and budgeted tool calling. The literature now contains strong partial solutions for each layer, but there is still no widely adopted unified stack that optimizes all four at once. The proposed architecture in this paper is intended as a production design pattern rather than a benchmark claim from this repository.

Keywords: multimodal agents, latency optimization, heterogeneous inference, dynamic routing, prompt caching, tool orchestration

1. Problem Statement

The current generation of multimodal agents is expected to interpret images, read context, reason over text, decide whether external tools are necessary, call those tools, and then synthesize a final answer. In practice, this pipeline is frequently too slow for interactive use.

The difficulty is structural:

multimodal requests contain different computational stages with very different resource profiles
model routing choices affect both quality and queueing delay
cache hits depend on prompt stability and server locality rather than model quality alone
tool calls add network delay, serialization overhead, and orchestration stalls

For a user, only end-to-end delay matters. For the system designer, however, that delay is the result of a chain:

T_total = T_encode + T_prefill + T_decode + T_route + T_tool + T_retrieval + T_queue + T_postprocess

A latency-aware orchestration layer therefore cannot optimize only one term. It must decide when to use a faster model, when to reuse a cache, when to skip or defer a tool, and when to separate multimodal stages onto different resources.

2. Why Existing Multimodal Agents Feel Slow

Recent research isolates several recurring causes of delay.

Source of delay	Why it happens	Practical consequence
Vision-language stage coupling	Image encode, prefill, and decode have different memory and compute needs, yet many systems schedule them together	Low utilization and inflated request latency
Head-of-line blocking	Large image or video requests monopolize shared queues	Short text or small-image requests wait behind heavy jobs
Router blindness	Static model selection ignores instantaneous queue state, cache locality, or latency budget	A high-quality choice can still be the wrong runtime choice
Cache fragmentation	Repeated prompts land on different workers or have unstable prefixes	KV reuse collapses even when semantic reuse exists
Tool overuse	Agents call tools for information that could be answered from local context or retrieved memory	External latency dominates response time
Sequential orchestration	Tool execution and later model work are handled strictly step by step	Parallelism opportunities are lost

HydraInfer argues that multimodal inference should not be treated like text-only serving with an image encoder attached; its encode, prefill, and decode stages should be scheduled separately across heterogeneous instances when service objectives demand it. RPS-Serve makes the complementary point that modality itself changes queue behavior: large multimodal jobs create blocking patterns that resemble oversized packets in a shared network, so the scheduler has to become modality-aware rather than merely throughput-aware. DistServe, although focused on text LLM serving, shows why separating prefill from decode can improve goodput under latency constraints; that insight transfers directly to multimodal agent backbones.

Latency pipeline decomposition

This article draws on three evidence streams: model-serving systems, routing systems, and agentic tool-execution systems.

Area	Representative source	Key observation used here
Multimodal serving	HydraInfer (2025)	Stage-level disaggregation reduces latency and improves throughput under multimodal workloads
Modality-aware scheduling	RPS-Serve (2026)	Small interactive requests should not be trapped behind heavy visual requests
Prefill/decode disaggregation	DistServe (2024)	Separating stages can substantially improve SLO attainment
Routing across models	MixLLM (2025), FrugalGPT (2023), RouteLLM (2024)	Query-level routing can trade off quality, cost, and latency rather than hardcoding one model
Cache reuse	vLLM Automatic Prefix Caching, OpenAI Prompt Caching	Prefix reuse is real, but only when prompts are stable and routed to compatible execution state
Tool-intercept serving	InferCept (2024)	Tool-augmented inference wastes time if each intercept forces redundant recomputation
Agentic orchestration co-design	Sutradhara (2026)	Tool execution, prompt splitting, and cache management should be integrated with the serving engine

Two findings are especially important for this topic.

First, the serving literature now clearly shows that stage separation matters. HydraInfer reports up to 4x higher multimodal throughput than vLLM under its evaluated workloads while meeting a 90th percentile request SLO. DistServe reports that prefill/decode disaggregation can serve substantially more requests under latency constraints. These are not generic optimization tips; they imply that a multimodal agent router should understand which stage is becoming the bottleneck before it chooses a model or dispatches a tool.

Second, the agentic inference literature shows that tool use is itself a latency system. InferCept reports that external interactions in augmented LLM inference can trigger redundant recomputation. Sutradhara reports that tool calls can consume 30-80% of first-token-rendered latency and that sequential orchestration wastes intra-request parallelism. This means that better tool use is not only a reasoning problem; it is a scheduling and cache-management problem.

4. Design Requirements for a Better Orchestrator

Any serious orchestration system for this workload should satisfy five requirements.

4.1 Stage Awareness

The orchestrator must distinguish at least these stages:

multimodal encode
language prefill
iterative decode
retrieval and memory lookup
tool execution and reintegration

Collapsing these stages into one routing decision hides the true bottleneck.

4.2 Budget Awareness

The orchestrator should maintain explicit budgets for:

wall-clock latency
model cost
tool-call count
external API timeout
evidence confidence threshold

Without budgets, tool use tends to expand until it dominates the interaction.

4.3 Cache Awareness

The system should preserve identical or near-identical prompt prefixes, tool schemas, and reusable system instructions. OpenAI’s prompt caching guidance explicitly notes that exact prefix matches, including identical images and tools, are required for cache hits. vLLM’s automatic prefix caching makes the same point at the engine layer. Therefore, the orchestration layer has to be written in a cache-friendly way, not merely deployed on a cache-capable backend.

4.4 Heterogeneous Routing

Different requests should not be forced through one general-purpose multimodal model. Simple visual classification, OCR-heavy extraction, policy lookup, and long-form synthesis rarely require the same routing policy.

4.5 Graceful Degradation

When the latency budget is nearly exhausted, the system should degrade predictably:

reduce candidate models
stop speculative tool exploration
shrink output length
fall back to text-only reasoning if visual evidence is already encoded
return a partial answer with explicit uncertainty if required

5. Proposed Architecture

The proposed architecture is a layered orchestrator that sits above the serving plane but exchanges runtime hints with it rather than treating the engine as a black box.

Latency-aware multimodal orchestration architecture

5.1 Request Ingress

The request is normalized into a structured envelope:

task type
modalities present
estimated complexity
user latency tier
maximum tool budget
cache key candidates

This is the only point where request classification should be expensive. Everything downstream should use compact metadata.

5.2 Stage Splitter

If the request includes images or video, the orchestrator separates visual encoding from text planning. This follows the same intuition as HydraInfer: multimodal encode does not need to be pinned to the same schedule as decode.

5.3 Router

The router selects among candidate models using a multi-objective score:

expected answer quality
expected queueing delay
expected cache hit probability
expected tool usage downstream
marginal cost

This is where the literature on FrugalGPT, RouteLLM, and MixLLM becomes useful. Their core lesson is that routing should be contextual and continuously updated rather than static.

5.4 Cache Broker

The cache broker maintains stable prompt prefixes, assigns prompt_cache_key values where the backend supports them, and prefers workers with recent matching prefixes when possible. The point is not just to enable caching but to avoid routing decisions that accidentally destroy cache locality.

5.5 Tool Budget Controller

The controller decides among four options:

no tool call
cached tool result
single direct tool call
multi-tool plan with strict stopping criteria

Each action must consume from an explicit latency budget rather than an implicit hope that the tool will be fast enough.

6. Dynamic Routing Strategy

Dynamic routing in this setting should be treated as a constrained decision problem, not a generic classifier.

Signal	Why it matters
Modality mix	Image and video requests produce different stage costs than text-only requests
Queue state	The best model in isolation may be the worst model under congestion
Cache locality	A slightly weaker model with a hot prefix can beat a stronger cold model on end-to-end latency
Tool likelihood	Requests likely to trigger retrieval or APIs should reserve budget before model selection
Output length estimate	Decode-heavy tasks need different choices than short extraction tasks

MixLLM is especially relevant because it frames routing as a dynamic tradeoff among quality, cost, and latency, and reports strong tradeoff performance under time constraints. That result should not be copied mechanically to multimodal agents, but it supports the architectural choice to make routing policy context-sensitive and continually adaptable.

7. Cache Strategy

Cache strategy has to be designed at three levels.

7.1 Prefix Stability

Place the most stable instructions first, variable user content later, and keep tool definitions identical whenever possible. This follows OpenAI’s prompt caching guidance directly and also improves compatibility with engine-level prefix caching.

7.2 Cross-Step Reuse

InferCept and Sutradhara both highlight a common failure mode in agentic systems: after a tool intercept, the next call often recomputes context that the system already had. The orchestrator should preserve reusable intermediate state across tool boundaries instead of serializing each step into a totally new request.

7.3 Worker Affinity

Even a perfect prefix is useless if requests are routed away from the worker holding the relevant cache state. For that reason, latency-aware routing should include cache affinity as a first-class objective rather than a backend implementation detail.

8. Budgeted Tool Calling

Tool calling should be governed by an explicit policy, not by unconstrained model preference.

8.1 Decision Rule

A tool should be called only if the expected utility exceeds the combined cost of:

added latency
added failure risk
serialization overhead
cache disruption
downstream synthesis time

8.2 Tool Tiers

Tier	Example	Default policy
Tier 0	local memory or cached retrieval result	use freely
Tier 1	fast deterministic API lookup	use if confidence gain is material
Tier 2	slower search, browser, or database action	require unmet evidence need
Tier 3	multi-step external workflow	require explicit budget and stopping rule

8.3 Early Exit Conditions

The orchestration loop should stop tool use when:

the marginal evidence gain is below a threshold
the latency budget is nearly consumed
two consecutive tools return redundant evidence
the answer can be completed with bounded uncertainty

This is where many agent stacks still underperform in production. They optimize correctness heuristics but lack hard stop conditions for latency control.

9. Recommended Evaluation Protocol

A serious evaluation should report more than average latency.

Metric	Why it is required
TTFT or FTR	Captures user-perceived responsiveness
End-to-end latency	Captures full completion cost
P90 and P99 latency	Tail behavior matters for interactive systems
Tool-call rate	Measures whether the agent is overusing tools
Cache hit rate	Reveals whether orchestration preserves reusable prefixes
Cost per successful task	Prevents fast but wasteful optimization
Quality under budget	Measures whether latency reductions harm task success

Workloads should be split by request class:

text only
text plus one image
OCR-heavy document image
large image set
video or long visual sequence
retrieval-rich tool-assisted task

Without workload segmentation, latency claims are difficult to trust because small text jobs can hide failures on heavy multimodal requests.

10. Practical Implementation Guidance

For a production system built today, the strongest design pattern is:

classify the request early
separate multimodal encode from text planning where the stack permits it
route using a multi-objective score that includes queue state and cache affinity
keep prompts and tool schemas prefix-stable
treat tool calls as budgeted actions
preserve intermediate state across tool intercepts
measure TTFT, end-to-end delay, tool rate, and cache hit rate together

This is a synthesis from the cited literature and vendor guidance rather than a claim that one framework currently solves the whole stack out of the box.

11. Limitations and Open Questions

Several hard problems remain open.

Most routing literature is still text-centric and does not fully internalize multimodal stage imbalance.
Cache systems are usually backend-local, while agent frameworks often make routing decisions without visibility into cache state.
Tool budgets are rarely learned from real utility curves; many systems still rely on heuristics.
Multimodal benchmarks often report throughput or average latency without enough tail-latency detail for interactive use cases.
Cross-provider routing is operationally attractive but difficult when capabilities, tokenization, and cache semantics differ.

The open research gap is therefore not how to make a model faster in isolation. It is how to jointly optimize routing, caching, and tool use when multimodal stages have heterogeneous latency behavior.

12. Conclusion

The evidence now supports a clear conclusion: multimodal agent latency is primarily an orchestration problem. Better base models help, but the decisive gains come from stage-aware scheduling, contextual routing, cache-preserving request design, and tool-use policies that respect strict latency budgets. Systems such as HydraInfer, DistServe, InferCept, and Sutradhara demonstrate that substantial latency reduction is possible when the orchestration layer and serving layer are designed together. What remains missing in most production stacks is a unified controller that makes those decisions jointly for multimodal, tool-augmented workloads.

References

Xianzhe Dong, Tongxuan Liu, Yuting Zeng, Liangyu Liu, Yang Liu, Siyu Wu, Yu Wu, Hailong Yang, Ke Zhang, and Jing Li. “HydraInfer: Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving.” arXiv, 2025. https://arxiv.org/abs/2505.12658
Konstantinos Papaioannou and Thaleia Dimitra Doudali. “Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference.” arXiv, 2026. https://arxiv.org/abs/2603.26498
Yinmin Zhong, Shengyu Hao, Weitian Wang, Yilong Lv, Yujie Wang, Dacheng Li, Xingda Wei, Zhenheng Tang, Xiaoxuan Jiang, Jiarui Fang, Yibo Zhu, Haibo Chen, and Xin Jin. “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.” arXiv, 2024. https://arxiv.org/abs/2401.09670
Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. “MixLLM: Dynamic Routing in Mixed Large Language Models.” NAACL, 2025. https://aclanthology.org/2025.naacl-long.545/
Lingjiao Chen, Matei Zaharia, and James Zou. “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” arXiv, 2023. https://arxiv.org/abs/2305.05176
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. “RouteLLM: Learning to Route LLMs with Preference Data.” arXiv, 2024. https://arxiv.org/abs/2406.18665
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang. “InferCept: Efficient Intercept Support for Augmented Large Language Model Inference.” ICML, 2024. https://proceedings.mlr.press/v235/abhyankar24a.html
Anish Biswas, Kanishk Goel, Jayashree Mohan, Alind Khare, Anjaly Parayil, Ramachandran Ramjee, and Chetan Bansal. “Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference.” arXiv, 2026. https://arxiv.org/abs/2601.12967
OpenAI. “Prompt caching.” OpenAI API Documentation. Accessed March 30, 2026. https://platform.openai.com/docs/guides/prompt-caching
OpenAI. “Latency optimization.” OpenAI API Documentation. Accessed March 30, 2026. https://platform.openai.com/docs/guides/latency-optimization
vLLM Project. “Automatic Prefix Caching.” vLLM Documentation. Accessed March 30, 2026. https://docs.vllm.ai/usage/automatic_prefix_caching.html

Abstract#

1. Problem Statement#

2. Why Existing Multimodal Agents Feel Slow#

3. Related Work and Evidence Base#

4. Design Requirements for a Better Orchestrator#

4.1 Stage Awareness#

4.2 Budget Awareness#

4.3 Cache Awareness#

4.4 Heterogeneous Routing#

4.5 Graceful Degradation#

5. Proposed Architecture#

5.1 Request Ingress#

5.2 Stage Splitter#

5.3 Router#

5.4 Cache Broker#

5.5 Tool Budget Controller#

6. Dynamic Routing Strategy#

7. Cache Strategy#

7.1 Prefix Stability#

7.2 Cross-Step Reuse#

7.3 Worker Affinity#

8. Budgeted Tool Calling#

8.1 Decision Rule#

8.2 Tool Tiers#

8.3 Early Exit Conditions#

9. Recommended Evaluation Protocol#

10. Practical Implementation Guidance#

11. Limitations and Open Questions#

12. Conclusion#

References#