At Comulate, we're transforming the insurance industry by automating the reconciliation of billions (and ultimately trillions!) of dollars. At the heart of this effort is our product allowing our customers to pay invoices on autopilot, which requires extracting structured data from invoices. This post explores a unique approach to the problem with a Retrieval-Augmented Generation (RAG) system that self-learns and delivers significant accuracy improvements over traditional pure LLM approaches.
After early prototyping and research, we found the most difficult problems were related to ambiguity and context. For example, try determining the amount due on the invoice below.
In a sea of information, there’s both an ‘Amount Invoiced’ and ‘Invoice Amount’— which is correct? In this case, context (e.g. “this sender always uses ‘Invoice Amount’) is difficult to discern without prior experience. Looking into our dataset, we found that difficult cases like this weren’t exceptions at all: they were the norm.
A key enabler was drawn from our prior work in document understanding: the format of a document is valuable context that unlocks better accuracy. Once you understand how to deal with a certain ‘type’ of invoice, the work becomes more routine. In the example above, the decision between ‘Amount Invoiced’ and ‘Invoice Amount’ should always be the same for a given format. But what exactly is a format, and how do you determine it for a given document?
We started with evaluating a few different options for a format identifier which all had significant tradeoffs.
Semantic Vectors
We hypothesized that forming document-level text embeddings and clustering them using an ML algorithm would be an effective strategy. To evaluate performance, however, we’d need ground truth data on the clustering, which would introduce additional complexity and cost. This, along with the additional infrastructure needed to implement this approach, led us to decide against it.
Visual Identifiers
Unique logos or relative keyword positions could be used to identify similar documents. However, we found that many documents were produced using the same software and thus shared visual identifiers, despite having subtle differences in format. Noise in the data also made this approach flimsy: detecting logos and relative keyword positions have inherent complexity.
After poring over the data, we realized a hidden-gem identifier hiding in plain sight. All invoices need to be paid, and the vast majority include instructions on where to send the money in the form of a bank account and routing number. Together, these items form a unique remittance pair.
Since ABA routing numbers follow a checksum, we can deterministically extract and verify remittance pairs without an ML approach. This turned out to be the most reliable format identifier we evaluated.
Self-learning RAG Pipeline
Because customers ultimately pay all of their invoices, we’re given a connection between source documents and human-verified output data. This product usage builds a valuable, ever-growing ground truth corpus that we can leverage in evaluating and improving system performance.
Then we began thinking about how we could leverage our format grouping in choice of models. Inspired by model distillation, we found that using large reasoning models to construct simplified directions for smaller models on a per-format basis was highly effective in allowing smaller, more cost-effective models to outperform larger models.
Leveraging our ground truth corpus and format identification, we designed a pipeline that:
As we continue to accumulate ground truth data, the pipeline builds per-format understanding; the system naturally improves over time without any direct involvement!
Performance
Using this strategy, we significantly improve performance while lowering the cost of inference by a factor of ~15x.*
We believe that the best software happens when you deeply understand both the technology and the business domain. We’re applying this to the insurance back office, which has gone largely untouched by modern technology, making it one of the exciting areas within fintech to apply AI. If this sounds interesting, we’re hiring engineers! Check out open roles here.
*Based on relative token use and cost of OpenAI GPT-4o, GPT-4o mini, and o1 as of Jan 2025. Cost of o1 inference is amortized across all uses for documents with matching format signatures.