Building a Multi-Turn, Multi-Agent, Data Insight Chat System Part 5: Model Matters, Context Matters

Introduction
In my last post I talked at length on how to actually build a multi agent chat system, but there was lot I had to leave out, in this post I’m going to go into a bit more detail in the impact the model context can have and the difference choosing the right model makes.
Choosing the right model
In all the previous parts of this series I’ve shied away from mentioning an actual model, I’ve referred to LLMs in the abstract mentioned “the Model” or “the agent”. The simple reason for this is that the model which drives this system has changed as many as 10 times throughout the project, and I’ve experimented with many more. As LLM technology moves on and different models become available I expect it to change again. Here are some of the notable stopping points along the way:
GPT3.5-Turbo 16k: What as readily available at the time development started. The limited context window made it difficult to work with but suitable for prototyping. Middling performance in tool selection.
Qwen3-14B: Affordable, quick and big step forward in terms of context window size (which can be pushed up to 128k using YaRN). Concerns over bias in the model, particularly when questions about topics the Chinese state has an interest in (Taiwan, Tibet etc) bring the usefulness of the model into question. Middling performance in tool selection. Some issues handing off the conversation between agents consistently.
GPT4o: A bit more expensive, but decent performance, decent output and decent at tool calling. Decent all around.
GPT-20b-oss: Very fast, very cheap, with good output, good at tool calling, good at handing off. In terms of bang for buck nothing beat this model, it allayed any concerns that delivering the feature would be too expensive if this model was the one driving it. However finding a provider that implemented forced JSON outputs and tool calling properly was a struggle, Azure doesn’t and the only other provider found that did was unreliable in terms of lag and uptime.
GPT-120b-oss: Very similar to the above but slightly better output and a bit more expensive, but still so cheap there’s really no reason not to run this model instead of the 20b variant. Comes will all the same tool call and forced JSON output problems though.
GPT-5.1: Good performance, great output, eventually cheaper than 4o (which OpenAI and Azure will soon decommission regardless), good tool use but tends to produce much longer outputs given the same prompts as the above models. Requires a bit of prompt tweaking to get the desired results.
You’ll notice that most of these are GPT based models originating from OpenAI, that’s not to say that other models weren’t experimented with. The LLama family of models produced pretty good results too, and offered large context windows but I did find that with these models although the window was large the model often didn’t seem to take it all into account in it’s responses.
The Chinese models we tried and those I’ve used in other domains for hobbyist projects are very good, but unusable in this project for reasons of bias and politics. It’s fascinating how quickly the Chinese have caught up in this area definitely worth keep an eye on!
Model Agnosticism
A good side effect of the forced JSON responses, built in “reasoning” and prompt architecture described in my last post is that you tend to get similar results regardless of the model. Almost by accident the system is model agnostic requiring very little prompt tweaking. This is a big advantage, it allows for the agents output to easily take advantage of advances and models with new capabilities.
Testing Model Output
While talking about the ease of switching models it’s worth briefly discussing how each model is assessed. Without a huge amount of resources this is extremely difficult. In practice formal testing was by presenting each model with fairly large series of prompts and agentic tasks and comparing against hand written ideal outcomes. Informal testing is gut based, the dev & testing teams working with the model, while not as rigorous this helps preventing over fitting prompts and other tweaks to the formal tests. Finally a small group of clients beta tests changes to the feature and provides valuable feedback based on real world situations and data.
Context Matters
Context is a word that has come up a lot in this series, broadly speaking this is everything we send to the model as part of a prompt, but the specifics matter. One innovation that came fairly late in the development of the system was the idea of organisational (or customer) context.
Organisational context varies greatly from customer to customer, it is usually about a page of text which is put together by account managers in collaboration with the customer. It describes the type of organisation, it’s goals, key players, management structure, strategic pillars and all that other background that the customer staff would take for granted when making decisions and thinking about their projects and data.
Good general context like this makes a massive difference to the output of the model and how relevant it is to users. It also helps to reinforce to users that the agents understand the world they’re operating in which increases confidence in the agents responses and makes conversations flow more easily without having to constantly re-state assumptions at the start of each conversation.
Conclusion
The model and context come together to power the agents and produce the responses of each agent in the system. Get both right and the those responsible will be useful, relevant and tailored to the organisation the system is running in. Get them wrong and the results will be disappointing.
Keeping the whole system as model agnostic as possible allows for quickly switching models to take advantage of the fast pace of AI development as new models become available.
In my next post I’ll wrap everything up and briefly go over a few things worth discussing that didn’t fit neatly into any of the other posts.



