Industry Talk #2: What It Takes to Deliver on Agentic AI for Investment Research at Deutsche Bank
In this talk, Wai will explore how banks are rethinking the use of AI beyond traditional quantitative models, focusing on a shift toward agentic and intelligence‑driven architectures. The session explains the importance of preparing the technical infrastructure for RAG development, which will ultimately speed up the research processes, improve governance, and reduce operational risk. Rather than simply adding AI tools, the talk highlights how banks are building structured “AI cookbooks,” centralized knowledge bases, and evaluation frameworks that allow humans and AI agents to collaborate at scale—evolving into faster, more adaptive, and more creative research, often described as “vibe research,” that is set to redefine innovation in financial institutions.
Speaker
Summary
Deutsche Bank: Infrastructure First, AI Second — Building Reliable Agentic Research
Speaker: Wai-Chung Ip (Caio), Senior Quantitative Developer, QIS Research, Deutsche Bank Date: March 12, 2026 Event: Paris — Market Data x AI (Finteda / FactSet)
Team & Context
The QIS (Quantitative Investment Solutions) research team at Deutsche Bank develops systematic investment strategies and proprietary indices for institutional clients. The team spans London and India with nine members plus Caio. He covers artificial intelligence and data science efforts.
The Research Process Challenge
A typical QIS research process — idea generation, literature review, data processing, backtesting, visualization, reporting — can last up to six months or longer. AI aims to compress this timeline, but the challenges are significant:
- Data comes from different sources in unstructured formats - Alternative data onboarding is not straightforward and difficult to govern - Researchers don't always know where to find the right data - Code is often difficult to maintain
Over time, these challenges accumulate into technical debt — or as Caio jokingly calls it, a "cable salad": a tangled mess of code, data, and workflows that are difficult to work with.
Core principle: If humans struggle to work or reason about this environment, AI will have no chance either. Throwing an LLM into a cable salad won't give you the best answer for every research question — it just gives you a faster way to get the wrong answer.
Infrastructure First: Three Non-Negotiables
"What would the research stack need to look like if a machine needs to reason about it?"
1. Data must be discoverable and well described 2. Code must be maintainable, interpretable, and reusable 3. Knowledge must live in structured, controlled, monitored, and evaluated systems (e.g., agentic RAG)
Solving the Code Problem: Research SDK
Research code is exploratory — researchers build scripts in inconsistent formats, leading to: - Duplicated functions across projects - Slightly different implementations of the same calculation - Backtests that can't be reproduced later - Projects locked in personal notebooks (centralized key-person risk)
Solution: Research SDK (Strategy Development Kit) — a unified library combining all code across strategies and utilities.
Benefits: - Enables collaboration within and across teams (e.g., structuring/index teams can replicate strategies with zero discrepancy) - Ensures consistent research quality standards across the QIS business
Backtesting Tools: Pipelines, Dashboards, and Inventory
Instead of building custom scripts, researchers follow a pipeline structure — an object that encapsulates configuration, data, and executable code. Pipelines are: - Serializable, storable, and transferable - Come with dashboards - Stored in an Inventory (a catalog of pipelines)
This abstraction also serves AI deployment: the Inventory becomes an arsenal of tools that agents can execute directly, eliminating the need to build separate MCP server functions for every capability.
Solving the Data Problem: Three Pillars
1. Centralized database — Unified access point for data and data onboarding 2. Alternative data onboarding via NelData — A third-party data catalog and scouting service that bridges data users and vendors. Provides access to ~8,000 datasets, customized scouting, evaluated data quality, due diligence, and analysis reports. 3. Data cataloging — Stores metadata (not data): what the data means, how it was collected, coverage, history, quality indicators, access permissions.
Why data catalogs matter for AI: AI cannot interpret raw tables, but it can reason over metadata. Instead of "search across all schemas and guess," you can ask: "Find the dataset with daily frequency, global equity coverage, history longer than 10 years." That's the difference between hallucination and retrieval.
Agentic RAG Architecture: Three Layers
1. Knowledge Provisioning
Start here — garbage in, garbage out. All resources must be connected to standardized interfaces (MCP servers) with consistent tool definitions.2. Agentic Interaction Layer
Every agent is equipped with skills and wrapped in MCP clients. The key challenge: MCP burns a lot of tokens if used naively (agents consume the full schema description for every function call).Solution: Progressive disclosure via skills. Skills tell the agent: "If someone wants data from a certain database, here's where to find the right function." The agent drills down to only the relevant tools, minimizing token consumption.
The pipeline Inventory further reduces overhead — researchers build executable pipelines rather than boilerplate MCP server functions.
3. Main Workflow
Implements RAG techniques with an iterative approach: - Action planning → Execution → Evaluation - Query transformation (break long/short queries into sub-queries, extrapolate context) - Search across document summaries first, then drill into subsets - Context engineering (vs. prompt engineering): everything the model knows before answering — system instructions, conversation history, retrieved documents, examples, and crucially, constraints (what NOT to do) - Evaluation, monitoring, and testing at every component level: task output accuracy, retrieval precision, LLM decision-making, human-in-the-loop feedback, A/B testingThe Vision: Bike-Researching
Instead of weeks of reading literature and months of coding, researchers can "broadcast" to agents that handle data identification and information retrieval — freeing humans to focus on idea generation and decision-making.
"In the long run, the teams that will lead in the AI race are not those with the fanciest model, but those with the most reliable infrastructure that allows the agent to perform in a consistent and sustainable way."
Q&A:
Q: Buy vs. build — how did you make that decision? A: We don't build LLM models ourselves, but we build most other components in-house, leveraging frameworks like LangChain and LangGraph. We are onboarding a lot of external resources to help with some processes and looking to incorporate those solutions into our infrastructure.
Q (from Laurent Fabre, Databricks): What kind of data platform are you using? Self-hosted or third-party? A: We have an in-house solution called DataStore — orchestrates data from different sources automatically on a daily basis via Temporal, and serves as a RESTful API web service for easy usage.
Q: How do you protect data privacy when using commercial LLMs? A: There are a couple of LLM services operating inside Deutsche Bank. Data can go to GCP or on-premise depending on the confidentiality flag in the metadata. If messages are flagged as confidential, processing stays on-premise instead of going to GCP.
Q: Are you only using LLMs for data retrieval, or also for designing new QIS strategies? A: The whole process is strategy-agnostic. The focus is on the process, not specific strategy types. For example, the LLM can scan thousands of scientific papers for equity strategy ideas or any other domain. The missing component is "bike-coding" — combining idea generation and code development end-to-end.
Full Transcript
[TALK]
Caio: This session we're going to talk about AI. But what's more important is something beyond AI as well. We are the QIS research team from Deutsche Bank. QIS stands for Quantitative Investment Solutions, and we develop systematic investment strategies and proprietary indices for institutional clients. Our research team spans across London and India and we consist of nine members and myself. My name is Wai-Chung Ip and I'm a core developer in a team that covers artificial intelligence and data science efforts.
In a typical research process, it consists of multiple key components — from idea generation, literature review, data processing, backtesting, visualization, reporting, so on and so forth. The whole process can be lengthy. It could last up to six months or even longer. And by using AI, we aim to compress the whole process into a shorter period of time.
However, things are a bit more challenging than you could imagine. For example, we could have data that comes from different sources in unstructured fashion. While alternative data is increasingly important, the onboarding process is not that straightforward and the usage could be difficult to govern. Researchers don't necessarily know where to find the right data, and they may write code that is difficult to maintain as well — all these sort of things. And over time all these kinds of challenges accumulate and build up something called technical debt, or we jokingly name it a "cable salad," which means that it's a tangled mess of code, data, and workflows that are difficult to work with.
And now here is the key observation. If humans struggle to work or reason about this environment, AI will have no chance either. Throwing a large language model into a cable salad won't give you the best answer for every research question — it just gives you a faster way to get the wrong answer.
And the core principle here is: we focus on infrastructure first, AI second. We start by asking the question — what would the research stack need to look like if a machine needs to reason about it? And it immediately narrows down to three non-negotiables. First, the data must be discoverable and well described. Second, the code must be maintainable, interpretable, and reusable. And third, we should have knowledge that lives in a structured, controlled, monitored, and evaluated system — such as agentic RAG.
Let's talk about the code first. Research code is exploratory, meaning that researchers build a lot of scripts that are not in a consistent format. And a lot of symptoms evolve because of this, such as duplicated functions across projects, or slightly different implementations of the same calculation, backtesting that couldn't be reproduced later, or projects that are locked in a personal notebook — which means there's centralized key-person risk.
And to solve this kind of problem, we decided to consolidate all development into a single Research SDK, or Strategy Development Kit. This is a library we developed to combine all the code across strategies and utilities. With this, it helps not only the collaboration within the team, but also collaboration across teams as well. Imagine: we, the research team, develop strategies, while the structuring and index teams need to understand what we have done or replicate our strategies as well. With this single unified access point, they would be able to understand, interpret, and replicate code with zero discrepancy. And this ensures that the research quality is up to the same standard across the whole QIS business.
On the other hand, we also built some backtesting tools, and this is quite relevant to AI development — I'll explain more later. Some examples of tools include pipelines, dashboards, and inventory. You can imagine that instead of building their own scripts, researchers can build and follow the pipeline structure, which is literally an object that encapsulates configuration, data, and executable code. This pipeline also comes with dashboards and the whole thing is serializable, storable, and transferable.
So if someone wants to use our code during backtesting, instead of installing the package again or requiring me to update the package and distribute one by one, we could literally ask researchers to create a pipeline, upload it to a server — or something called the Inventory, like a catalog of pipelines — and then end users go to the Inventory, download the code they want, do the backtest, get the results they want, and perform execution on a local environment. And this is something also related to AI deployment as well.
On the other hand, on the data side, there are three pillars. Centralized database — this is very important because it serves as a central unified access point for data, data onboarding, and data cataloging.
Alternative data is getting more and more important because it encompasses different kinds of alpha that traditional data won't be able to offer. However, onboarding alternative data is not that straightforward. It comes with a lot of risk. It could come from different vendors, and it would be difficult to assess the procurement or to know whether the data collection process is transparent enough for us to use it with confidence.
So we decided to partner with a third-party vendor called NelData. NelData is a data catalog and data scouting service provider that bridges the gap between the data user and the data vendor. On its platform, we can search through about 8,000 datasets and we could ask for customized data scouting, or they can provide evaluated data quality, due diligence, and analysis reports as well. This basically saves our time in doing the work to collect different vendors and assess the risk one by one.
On the other hand, once we get the data and put it in a centralized database, then we should think about doing data cataloging. A data catalog doesn't store data, but it stores metadata. It describes what the data means, how it was collected, coverage, history, quality indicators, access permissions, and so forth.
And why does it matter for AI? It's because AI cannot interpret raw tables, but it can reason over metadata. Instead of asking, "Hey, can you search across all schemas and guess what's relevant?" — we can ask the agent, "Find the dataset with daily frequency, global equity coverage, history longer than 10 years," for example. That's the difference between hallucination and retrieval.
And the last bit: let's talk about agentic RAG. The architecture is a bit involved, so let me try to break it down.
In our agentic RAG development, we separate the architecture into three layers. The first one is knowledge provisioning, and we always suggest that you should start from knowledge provisioning, because everyone knows that garbage in, garbage out, right? If we throw unorganized input into an LLM, it's not going to understand whether the inputs are garbage or not. On the other hand, all the resources should be connected to a standardized interface — like an MCP server — with tool definitions that are consistent and available for agents to interact with in a standardized fashion.
Then we have the agentic interaction layer. Every agent is equipped with skills and also wrapped in MCP clients to interact with the MCP server. They work in a collaborative fashion. But the main issue of MCP is that it burns a lot of tokens. Imagine if you just throw the MCP tools directly without any guidance. Then for every single call, the agent needs to consume the MCP description schema for every single function. However, with skills — and a proper skills protocol similar to progressive disclosure — the skills will signal to the agent: "Hey, if someone wants to get data from a certain database, here's where to find the right function." So instead of looking through all the MCP tools, it will find the relevant ones and drill down to the relevant tool, minimizing token consumption as well.
And going back to the Inventory backtesting tool we just talked about — it basically abstracts away the necessity to build functioning tools for every single function as an MCP server endpoint, which means it saves a lot of boilerplate. Instead of asking researchers to build code inside MCP, they could build a pipeline easily executable by the agent to perform a single function call. The Inventory, in this sense, literally becomes an arsenal for doing research.
And finally we have the main workflow. The main workflow implements RAG techniques — parse the query, find the information, and refine information using an iterative approach. We perform action planning, execution, evaluation, and ensure that the outputs are properly evaluated. We make sure the output meets our standards before generating the final output.
Everyone knows that generative AI can be problematic because the outputs are probabilistic — for example, hallucination. It could give inconsistent answers or make unexpected tool calls. Imagine you need to perform a chain of tool calls and there is a small chance of risk that it will perform an incorrect call. This risk will accumulate and multiply across the execution chain.
How do we solve it? How should you think about the solution? The first one is RAG techniques, for example query transformation. User queries can be too long or too short, and information can be inaccurate. We can use a large language model to break the queries into sub-queries or extrapolate context before sending it to the next step. Instead of searching through paragraphs in each document, we can ask the model to search across the summaries first, extract the subset of documents, and scan inside the subset instead. This makes it more accurate, but it requires more heavily engineered infrastructure.
The second one is context engineering. Compared to prompt engineering, which is about what we ask, context engineering is about everything the model knows before and while answering. The context can be the system instructions, the user prompt, conversation history, retrieved documents, examples, data, and more importantly, constraints — to tell the agent what not to do. It matters because LLMs don't truly remember facts outside the context window. They are sensitive to wording, order, and relevance. So for example, a message to an LLM could be filtered and structured with persona, examples, and documents. And all of these are going to tell it what to do and, more importantly, what not to do as well.
And the final bit is evaluation, monitoring, and testing. Given the output is probabilistic, in order to be reliable, we need to implement proper evaluation, monitoring, and testing across every single component. At the lowest level, for instance, we can evaluate and monitor the task output, outcome accuracy, retrieval precision, and how the LLM makes decisions as well. Always consider the LLM as something that needs human-in-the-loop guidance to steer the process, collect user feedback, and perform testing — A/B testing.
With all these efforts we can have a solid and reliable infrastructure that allows us to perform something called "bike-researching." Instead of letting researchers go through weeks of reading documents and literature, and then another few months to perform the coding, they could literally broadcast to the agent, which performs the data identification and finds the right information — and they just focus on idea generation and decision-making.
In the long run, the teams that are going to lead in the race of AI are not those with the fanciest model, but those with the most reliable infrastructure that allows the agent to perform in a consistent and sustainable way. Thank you.
[Q&A]
Moderator: Perfect timing. We have one or two questions.
Caio: Sure.
Audience Member: Thank you very much. Very insightful. My question is about this particular setup. When you designed it, how did you make — or based on which criteria did you make — the buy versus build decision? Is it all fully built in-house?
Caio: We don't build LLM models ourselves, but for the other components we build them ourselves. We leverage frameworks like LangChain and LangGraph to build things. But things are changing, so we are onboarding a lot of external resources to help with some of the process, and we're looking forward to incorporating those solutions into our infrastructure.
Moderator: Thank you.
Laurent Fabre (Databricks): Very interesting presentation. Thank you so much. My main question would be: what kind of data platform are you looking at? Are you hosting it yourself or do you go through a third party? What does it look like?
Caio: When you say data platform, I assume you're talking about how we host the data, how we provision it. We have an in-house solution. We built a solution called DataStore. DataStore orchestrates data from different sources automatically on a daily basis via Temporal, and it serves as a web service — a RESTful API — for easy usage.
Laurent Fabre: Okay, thank you so much.
Moderator: Okay, one more question.
Audience Member: You mentioned you use commercial LLMs. How do you protect data privacy? And when you build a backtesting platform for financial data, how do you handle that when using LLMs?
Caio: I'll try to answer to the best of my knowledge. There are a couple of LLM services that operate inside Deutsche Bank, and the one we use — based on the IT solution — the data could go to GCP or go on-premise. It depends on, when you interact with the LLM, what confidentiality flag you are adding into the metadata. If you specify your messages as confidential, it goes on-premise instead of GCP. So I hope this answers your question.
Moderator: Okay, good. One last question.
Audience Member: Hello, thank you for the talk. I was wondering — what kind of QIS strategies are you working on? Are you only using LLMs for data retrieval, or are you going one step further and trying to design new QIS strategies?
Caio: The whole process is strategy-agnostic. What we focus on is the process, not a specific type of strategy. If someone wants to look into equity strategies, we can ask the LLM, "Hey, can you go through thousands of scientific papers and get some ideas?" Similarly, if someone wants to build other types of strategies, they can do the same thing as well. There is actually one missing component here, which is "bike-coding" — and with that kind of implementation, we can literally combine the idea generation and code development process to make sure the whole process is complete from end to end.
Moderator: Okay, thank you.