How NOT to Build a GPT-5 Style Router

Background

The concept of optimizing LLM usage gained significant traction with ideas like FrugalGPT, which introduced model cascading, a technique for calling a series of models from cheapest to most expensive/advanced to save costs. The next evolution of this idea was the LLM router, which aims to select the most promising model for a task in one shot. The recent launch of GPT-5, with its sophisticated built-in router, has clearly proven the immense value of this approach at scale.

The power of intelligent routing isn’t a new concept for many enterprises, who have long realized that a generic solution is insufficient to unlock maximum performance for their specific needs. The launch of GPT-5 and its mixed feedback, however, has brought mainstream attention to routing and officially shifted the broader conversation from “if” to “how”.

The Problem with One-Size-Fits-All Routers

The core challenge with a generic, one-size-fits-all router is its tendency to under-segment traffic. A common failure mode is how generic routers use pairwise preference data. Suppose that, averaged over a certain slice of traffic, 60% of users prefer Model A over Model B. If the router treats that statistic as globally decisive, it will route most related traffic to A and miss segments and contexts where B reliably outperforms A. In practice, the goal is to discover and exploit those niche strengths (per task type, prompt pattern, end user preference, etc.), not to crown a single global winner.

Consequently, many enterprises explore building their own router to get a system that can handle the specific details of their prompts, users, and business rules.

The Pitfalls of the Traditional ML Path

However, this path leads them to adopt a traditional machine learning workflow: gathering and curating vast datasets, training and validating a classification model, deploying it into production for A/B testing. This entire costly cycle must then be repeated as soon as performance degrades due to data drift or the release of a new frontier model, which nowadays happens on a weekly basis.

Below, we’ll discuss a more effective and adaptive approach.

Human Intuition: The Ultimate Routing Engine

Consider how you personally use different LLMs for various coding tasks. You’ve likely developed an intuition for which model to use for which scenario.

For a complex task involving multiple libraries and dependencies, you need a model that can follow instructions precisely with minimal API hallucinations.
For brainstorming the initial design for a new system, you probably turn to a more knowledgeable model that excels at multi-step reasoning.
For a quick lookup, like how to calculate a mean average in PyTorch, any mini and fast version of frontier models would likely suffice.

You aren’t picking models at random, you are exploiting a sophisticated, personal routing policy refined by experience.

Learning from Experience: A Reinforcement Learning Framework

This intuitive process is essentially a form of reinforcement learning (RL). It breaks down into a simple loop:

State: The user’s prompt and its surrounding context define the current state.
Action: An LLM is selected to handle the prompt. This is the action.
Reward: The user evaluates the result. Was it helpful? Was it accurate? This assessment is the reward signal.
Policy: Over time, the system learns which actions (models) yield the highest rewards for different states (prompts and contexts). This learned strategy becomes its policy.

The system explores, gets feedback, and adapts its policy to improve future outcomes. Why shouldn’t our enterprise systems be built to do the same?

The Anatomy of a Learning Router

If we want to build a router based on this principle, the “state” (the user’s prompt and context) and the “action” (picking an LLM) are straightforward. The most challenging part is defining the “reward”.

Feedback Loop

This is where an effective router must go beyond its own code, it needs to integrate with a customer’s existing observability and analytics ecosystem. The reward signal can’t be invented or judged by another LLM, it must be derived from real feedback loops. Did the user give a thumbs-up? Did they retry their query? Did the task complete successfully? A router that truly works needs to connect to these feedback mechanisms to understand what success actually looks like. Any prior built reward model, whether rule-based or learned from a static benchmark-alike dataset, will ultimately prove inadequate because it is disconnected from the actual, real-time feedback of your users.

Long-Horizon Planning for Agentic Tasks

Reinforcement Learning distinguishes itself from other online learning approaches because real-world feedback is messy. It’s often delayed (a user might give a thumbs-up minutes later) or sparse (many users provide no feedback at all). Furthermore, many valuable tasks are multi-turn. Consider an autonomous coding agent in a CLI environment. The initial, critical decision to use an advanced reasoning model to create an execution plan is only validated after the agent successfully executes a series of subtasks and observes the results. RL is particularly well-suited to learn the connection between an early strategic choice and a distant outcome.

A useful metaphor is to compare standalone embedding models with the embedding layer inside an LLM. The former is trained for document similarity, a relatively simple retrieval task. The latter is trained for something far more complex: predicting the next token to generate a coherent response. Similarly, a router’s goal isn’t just to find a model that is “similar” to what worked in the past, it must optimize for the entire sequence of events that leads to a successful outcome within the customer’s unique context.

The Latency Question

An interesting consideration is whether latency should be a direct factor in the routing decision. We argue that it shouldn’t. In production, latency isn’t a linear “faster is always better” metric. Instead, it’s measured in a range from “unacceptably slow” to “fast enough.” An extreme latency issue, like a model “thinking” for five minutes to write a one-liner of Java, will naturally surface in the user feedback through low engagement. On the other extreme, for a monumental task like creating a billion-dollar business, the difference between a three-day and a thirty-day turnaround is negligible.

Latency is an integral part of the overall user experience, and its impact is best captured through the same reward signals we use for quality and user experience.

Online vs. Offline Learning

With a mechanism for capturing feedback established, the next critical choice is the learning methodology. An offline approach, which involves collecting data over a period and retraining the model periodically, presents significant challenges for the dynamic nature of LLM interactions. The LLM landscape changes weekly. A new, more capable model might be released, an existing model could silently degrade in quality on specific tasks, or pricing structures can shift overnight. Even internal business requirements can change, such as switching from maximizing performance to optimizing for cost once a quality threshold is met. By the time an offline model is retrained on historical data, it’s already making decisions based on an outdated reality.

An online, real-time system offers a far more effective paradigm. It learns from every single interaction as it happens, allowing it to adapt immediately to these changes. It can start exploring a new model as soon as it’s available and react quickly if a preferred model’s performance drops, ensuring its routing strategy is perpetually optimized for the present moment.

Design Principles for an Adaptive Router

Considering these points, a successful online RL router must be guided by several core principles: it should be intelligent, private, flexible, and economically practical.

Privacy: An adaptive router must be deployed in a “Bring Your Own Compute” (BYOC) model, meaning the learning algorithms run entirely inside the customer’s infrastructure, keeping all prompts and feedback signals secure.
Dynamic Action Space: The set of available LLMs, or the “action space” in RL terms, is constantly changing. It needs the ability to update its set of available models in real-time, adding new options and removing deprecated ones without interrupting service or retraining.
Learning Efficiency: An effective router should be compute-friendly, capable of running on a standard multi-core CPU instance. This is a crucial distinction, as many powerful RL algorithms require immense computing power that is impractical for this application. Running a large-scale PPO algorithm for routing, for example, would introduce costs that defeat the purpose of optimization.

The Future is Adaptive

The traditional ML workflow is simply too slow for the fast-paced reality of AI landscape. The future belongs to systems that learn and adapt in real-time. An online, learning-based approach, built on the principles of privacy, flexibility, and efficiency, allows a router to move beyond optimizing for the majority and instead learn the specific context of your users to find the best model for each task.

We built the Arfniia Router around these ideas. If you’re a developer who is curious about this adaptive approach, we’ve made it available on the AWS Marketplace with a 15-day free trial. We’re genuinely interested to see what the community builds with it and would love to hear your feedback. Feel free to reach out directly at shawn@arfniia.com with any questions or thoughts.