This post is the fourth in a series about how to implement legal AI that knows your law firm. In the series, we cover the differences between LLMs and search, the elements that make a good search engine, the building blocks of agentic systems (e.g. RAG), and how to implement a system that is fully scalable, secure, and respects your firm’s unique policies and practices.
By the time a law firm reaches more than a few dozen lawyers, the need for “knowledge management” becomes clear. Institutional knowledge includes valuable prior work interpreting points of law, court or transactional procedures, or simply a clause or paragraph that was expertly crafted to deal with a particular client situation. All of these needs recur. A good approach to knowledge management provides legal professionals with the best, right prior knowledge just when and where they need it.
To do this, a good retrieval system needs not just to have a blend of keyword and vector techniques (as described in the previous post); a good retrieval system is one that is constantly informed by the entire context of the firm. What if the precedent, memo or checklist you need is sitting in a client-matter by another partner, one whom you’re not even aware worked on that topic? What if it is tucked away in their email? Or on some intranet repository?
Without a good retrieval system, a firm has no choice but to rely on “pardon the interruption” emails and labor-intensive curation efforts. So-called PTI emails suffer from inefficient and inconsistent responses. Curation efforts, which can be valuable to identify well-respected and vetted precedents, can be burdensome to conduct and can’t alone serve up answers to the full range of real life scenarios legal professionals encounter on real client work.
An essential aspect of search is its ability to tame the fragmented and sometimes “messy” data found across a law firm’s systems—across email, shared drives and document management systems— to understand each document's content and context. It needs to recognize that certain documents are related to each other—that several drafts and versions of a given document might all be responsive to a search, but that they should be treated as related to each other rather than distinct content. Similarly, the search system can group documents together that are related to each other in some other dimension, such as being authored by the same person or related to the same company or transaction. If expert-curated collections exist, it can take those into account to improve results; a good search doesn’t depend on the pre-existence of curation—rather, it can help with the identification of the best precedents in real time.
In addition to retrieving data from fragmented and disparate data sources, a crucial function of search is to identify the connections between documents, entities, and data points. That’s where metadata comes in—data about the data. Metadata can identify the document's author, the matter to which it belongs, the industry associated with the matter, the outcome of the matter it is connected to, and so on. Search can also identify metadata useful for sorting and filtering documents, such as which users in the system have accessed and used a document, as well as how often it has been accessed. A strong enterprise search system will also be able to categorize documents (both automatically and by ingesting the categorizations that already exist within a firm’s systems), allowing search results to be filtered by key attributes of the document itself. All of this metadata is crucial in helping a search system assess the importance of documents and provide the critical context that gives meaning to the data.
An interesting quirk of the disparate sources of law firm data (across document management systems, intranet, email and other tools) is that each source tool has its own native search system. While it can be tempting to rely on the source tool’s search, the reality is that few of these tools are focused on providing excellent retrieval. Moreover they each implement search in a different way. It is the job of a good enterprise search to not just deal with this complexity but to find the most relevant content regardless of the limitations inherent in the source systems.
Enterprise search works well because it creates and continuously maintains a single, unified index of all of your firm’s data—regardless of where it resides—and enriches it with consistent metadata, security permissions, and relevance scoring.
Because building a high-quality enterprise search engine that can scale to billions of documents is hard work, some vendors employ a technique called federated search. Federated search is a method that sends your query out to each system separately, relying on each system’s built-in search capabilities to retrieve results in real time, which are then collected and presented together without any unified indexing or consistent ranking.
Federated search is fundamentally limited because it merely sends your query out to each individual system in parallel—relying on each system’s native search capabilities, indexing rules, and response times. This approach fragments both the process and the results. Instead of a cohesive, unified picture of your institutional knowledge, you receive a collection of partial answers, each shaped by the constraints of the system it came from. If one repository has weak search or incomplete metadata, its results will be shallow or irrelevant, regardless of the strength of the overall query. Moreover, federated search cannot reliably deduplicate or reconcile overlapping documents, such as multiple drafts or versions saved in different locations, nor can it provide consistent ranking across systems. The burden of interpreting and stitching together the fragments is left to the user—an unacceptable inefficiency when dealing with high-stakes, time-sensitive legal work.
In addition, federated search cannot achieve the speed, enrichment, or contextual awareness of enterprise search. Because no unified index exists, every search requires live queries to each system, resulting in unpredictable delays and variability in performance. Critical AI capabilities—such as automatically classifying documents, linking related records across systems, and extracting entities for contextual filtering—are impractical without a central index to operate on. This means that federated search will always lack the deep connective tissue that transforms isolated data into actionable knowledge. For a modern law firm, where the value of work often depends on quickly understanding how disparate documents and facts fit together, federated search simply cannot deliver the integrated, intelligent retrieval experience that true enterprise search provides.
The fragmented systems and diverse data types inherent in law firm data are likely to persist as the volume of work product continues to increase. A firm can’t “federate” or “curate” its way out of that problem. With the right search system, however, your firm can work with the mess, not around it—and the search system can provide the LLM with the critical data it needs to weave the firm’s expertise into its generative responses.
Explore the blog series “Legal AI That Knows Your Firm”
Posts in this series:
- The Allure (and Danger) of Using Standalone LLMs for Search
- Why Retrieval Augmented Generation (RAG) Matters
- All Search Engines Are Not Created Equal
- Why good legal search is informed by the entire context of your institutional knowledge—not siloed or “federated” (this post)
- Coming soon
This post was adapted from our forthcoming 24-page white paper entitled "Implementing AI That Knows Your Firm: A Practical Guide." Sign up for our email list to be notified when the guide is available for download.