Route your prompts to the best LLM
Hey HN, we've just finished building a dynamic router for LLMs, which takes each prompt and sends it to the most appropriate model and provider. We'd love to know what you think!Here is a quick(ish) screen-recroding explaining how it works: https://youtu.be/ZpY6SIkBosEBest results when training a custom router on your own prompt data: https://youtu.be/9JYqNbIEac0The router balances user preferences for quality, speed and cost. The end result is higher quality and faster LLM responses at lower cost.The quality for each candidate LLM is predicted ahead of time using a neural scoring function, which is a BERT-like architecture conditioned on the prompt and a latent representation of the LLM being scored. The different LLMs are queried across the batch dimension, with the neural scoring architecture taking a single latent representation of the LLM as input per forward pass. This makes the scoring function very modular to query for different LLM combinations. It is trained in a supervised manner on several open LLM datasets, using GPT4 as a judge. The cost and speed data is taken from our live benchmarks, updated every few hours across all continents. The final "loss function" is a linear combination of quality, cost, inter-token-latency and time-to-first-token, with the user effectively scaling the weighting factors of this linear combination.Smaller LLMs are often good enough for simple prompts, but knowing exactly how and when they might break is difficult. Simple perturbations of the phrasing can cause smaller LLMs to fail catastrophically, making them hard to rely on. For example, Gemma-7B converts numbers to strings and returns the "largest" string when asking for the "largest" number in a set, but works fine when asking for the "highest" or "maximum".The router is able to learn these quirky distributions, and ensure that the smaller, cheaper and faster LLMs are only used when there is high confidence that they will get the answer correct.Pricing-wise, we charge the same rates as the backend providers we route to, without taking any margins. We also give $50 in free credits to all new signups.The router can be used off-the-shelf, or it can be trained directly on your own data for improved performance.What do people think? Could this be useful?Feedback of all kinds is welcome!
Users are intrigued by the Show HN product, with some expressing concerns about the business model, router approach, and potential for AI monopolies. There's interest in the discount incentives, revenue sharing, and the tool's ability to simplify model selection. Questions about data storage, terms of service, and performance are raised. Some users prefer fixed fees or commissions for stability, while others suggest charging based on savings. The product is compared to existing tools like LangChain and LlamaIndex, and there's a call for benchmarks and performance data. Concerns about the interchangeability of LLMs and the desire for fixed models are noted, alongside the potential for unifying services.
Users expressed concerns about unclear monetization, data usage policies, and the potential for quality to suffer due to cost-saving. There were questions about the business model's sustainability, latency issues, and the lack of clarity on evaluating and transitioning between language models. Criticisms also touched on the risk of violating terms of service, the absence of certain integrations, and the potential for gaming the system. Some users were skeptical about the technology's maturity and the effectiveness of routing across providers, while others noted minor issues like typos and the need for additional features like benchmarks.