Building Self-Learning RAG For Document Understanding

At Comulate, we're transforming the insurance industry by automating the reconciliation of billions (and ultimately trillions!) of dollars. At the heart of this effort is our product allowing our customers to pay invoices on autopilot, which requires extracting structured data from invoices. This post explores a unique approach to the problem with a Retrieval-Augmented Generation (RAG) system that self-learns and delivers significant accuracy improvements over traditional pure LLM approaches.

Problem: Context & Ambiguity

After early prototyping and research, we found the most difficult problems were related to ambiguity and context. For example, try determining the amount due on the invoice below.

‍

‍

In a sea of information, there’s both an ‘Amount Invoiced’ and ‘Invoice Amount’— which is correct? In this case, context (e.g. “this sender always uses ‘Invoice Amount’) is difficult to discern without prior experience. Looking into our dataset, we found that difficult cases like this weren’t exceptions at all: they were the norm.

Determining Format

A key enabler was drawn from our prior work in document understanding: the format of a document is valuable context that unlocks better accuracy. Once you understand how to deal with a certain ‘type’ of invoice, the work becomes more routine. In the example above, the decision between ‘Amount Invoiced’ and ‘Invoice Amount’ should always be the same for a given format. But what exactly is a format, and how do you determine it for a given document?

Format Identifier Approaches

We started with evaluating a few different options for a format identifier which all had significant tradeoffs.

Semantic Vectors

We hypothesized that forming document-level text embeddings and clustering them using an ML algorithm would be an effective strategy. To evaluate performance, however, we’d need ground truth data on the clustering, which would introduce additional complexity and cost. This, along with the additional infrastructure needed to implement this approach, led us to decide against it.

Visual Identifiers

Unique logos or relative keyword positions could be used to identify similar documents. However, we found that many documents were produced using the same software and thus shared visual identifiers, despite having subtle differences in format. Noise in the data also made this approach flimsy: detecting logos and relative keyword positions have inherent complexity.

Our Solution: Remittance Pairs

After poring over the data, we realized a hidden-gem identifier hiding in plain sight. All invoices need to be paid, and the vast majority include instructions on where to send the money in the form of a bank account and routing number. Together, these items form a unique remittance pair.

Since ABA routing numbers follow a checksum, we can deterministically extract and verify remittance pairs without an ML approach. This turned out to be the most reliable format identifier we evaluated.

Self-learning RAG Pipeline

Because customers ultimately pay all of their invoices, we’re given a connection between source documents and human-verified output data. This product usage builds a valuable, ever-growing ground truth corpus that we can leverage in evaluating and improving system performance.

Then we began thinking about how we could leverage our format grouping in choice of models. Inspired by model distillation, we found that using large reasoning models to construct simplified directions for smaller models on a per-format basis was highly effective in allowing smaller, more cost-effective models to outperform larger models.

‍

‍

Leveraging our ground truth corpus and format identification, we designed a pipeline that:

Aggregates the ground truth corpus partitioned by format
Feeds ground truth inputs & outputs for a given format to a reasoning model
Uses a reasoning model to construct a per-format prompt for extracting the truth outputs, given the inputs
Store the returned prompt for the given format
Apply this prompt for all future extractions for this format

As we continue to accumulate ground truth data, the pipeline builds per-format understanding; the system naturally improves over time without any direct involvement!

‍Performance

‍Using this strategy, we significantly improve performance while lowering the cost of inference by a factor of ~15x.*

Final Thoughts

‍We believe that the best software happens when you deeply understand both the technology and the business domain. We’re applying this to the insurance back office, which has gone largely untouched by modern technology, making it one of the exciting areas within fintech to apply AI. If this sounds interesting, we’re hiring engineers! Check out open roles here.

_{*Based on relative token use and cost of OpenAI GPT-4o, GPT-4o mini, and o1 as of Jan 2025. Cost of o1 inference is amortized across all uses for documents with matching format signatures.}

Share this post

Direct Bill

Triage

About the partner

Comulate