Fluree Blog Blog Post Arnaud Cassaigne07.21.25

Understanding Intelligent Document Extraction with Fluree: Part 2 – Hybrid Models

This is part 2 of an article which explores Fluree's hybrid document extraction approach, combining efficient on-premise tools with selective cloud enhancement to balance accuracy, cost, and operational constraints for diverse document types. This article was co-authored by Arnaud Cassaigne and Aness Hmiri.

The first part of this series demonstrated how Fluree integrates cutting-edge LLMs with its secure knowledge graph platform to tackle complex document extraction challenges—from OCR to multimodal understanding. However, not every use case requires the raw power (or the cost) of a top-tier cloud model like Gemini or Claude. This second article explores lighter, hybrid approaches that combine efficient on-premise tools with selective cloud enhancement—while still benefiting from Fluree’s orchestration, validation, and governance pipeline.

This strategy addresses a crucial need for organizations operating under strict privacy, latency, or budget constraints. By orchestrating locally deployable tools such as Docling, Tesseract, and PyMuPDF—and optionally augmenting them with advanced LLMs like Gemini Flash—Fluree offers a scalable and cost-optimized alternative without sacrificing accuracy or traceability.

Document Extraction Tasks

As in Part 1, we evaluated multiple tools on four core document extraction tasks:

  • Structured Text Extraction from PDFs: Recovering heading hierarchies and paragraph flows from complex layouts.
  • Table Extraction: Parsing embedded or irregular tabular data into structured formats.
  • Handwritten OCR: Digitizing historical or cursive handwriting.
  • Text-in-Image Extraction: Recovering contextual text from embedded images (e.g., signage or diagrams).

Hybrid models used by Fluree

We tested five tools, each with distinct characteristics:

  • Docling: Local-first open-source tool combining RapidOCR with a layout analyzer. Ideal for structured documents.
  • Docling + Gemini 2.0 Flash: Hybrid mode where complex pages are selectively routed to Google’s Gemini Flash model.
  • Tesseract: Classic open-source OCR engine best suited for clean, printed text.
  • PyMuPDF: Lightweight PDF text extractor prioritizing speed over structure or OCR.
  • Unstructured.io: Library designed for broad-format document parsing, with limited open-source capabilities.

Comparative Analysis

Docling delivers excellent performance on structured text and table recognition—often approaching commercial-grade accuracy. It processes documents locally in ~0.9 seconds per page with zero API cost. However, it lacks handwriting or image-based OCR support.

Docling + Gemini Flash significantly extends Docling’s capabilities. In hybrid mode, Gemini Flash handles the “hard” pages—such as scans with cursive writing or image-based text. The result is near-perfect extraction across all content types, while keeping average cost under $0.0015 per page and latency around 1.6 seconds.

Tesseract provides acceptable results for printed text but struggles with layout detection and fails entirely on handwriting. It remains useful for high-speed, low-fidelity tasks.

PyMuPDF is ultra-fast (~4ms/page) but offers raw text only—no structure, no tables, and no OCR. It’s best reserved for lightweight tasks like simple metadata indexing.

Unstructured.io offers a broad but shallow approach, underperforming across all tested dimensions and showing high latency (~4.2s/page).

Case Studies of Hybrid Extraction with Fluree

Just like in Part 1, we validated hybrid performance through real-world scenarios:

ASU Sustainability Impact Review (Docling + Gemini Flash)

We processed the “ASU Sustainability Impact Review” report, a richly formatted blend of narrative text, multi-column data tables, SDG icons and event photos through a hybrid pipeline in which Docling parsed all textual content and structure while Gemini interpreted every embedded image; the run was flawless, with Docling preserving the logical hierarchy of headings, tables and note links, and Gemini producing concise, context-aware alt-descriptions for more than 70 visuals (for instance, the vehicle-display photo is rendered as “The image shows several white vehicles parked in an outdoor setting with an Arizona State University (ASU) bus visible in the background. The bus features text promoting environmental awareness.”. The result is a clean, analysis-ready dataset plus accurately captioned imagery that needs only minor styling tweaks before publication.

Scientific Article Extraction (Docling + Gemini Flash)

Applying the same hybrid pipeline to a peer-reviewed article on flow-battery technology, Docling again captured the full scholarly structure—section headings, equations, references and multi-panel figures—while Gemini Vision handled the visuals, producing precise alt-text and even transcribing embedded labels; for example, Figure 2 (“Example of How Flow Batteries Work for a Grid Application”) was summarised as “The image illustrates a redox flow battery system with electrolyte tanks, a separator, cathode, anode, pumps and a grid connection. Arrows indicate the flow of electrolytes during use and charging processes,” with Gemini also extracting on-diagram terms such as “Cathode,” “Separator,” and “Charge Pump.” Overall, both text and image layers were flawlessly harvested, leaving a clean, richly annotated dataset ready for analysis or publication.

Financial Table Extraction (Docling)

Docling handled the consolidated-income-statement table just as smoothly: it preserved the full row order, note references, year columns and parenthetical negatives, correctly distinguished subtotals like “Gross margin” and “Operating profit,” and carried through the earnings-per-share lines without mis-typing long share counts, so the extracted table is structurally identical to the source and immediately usable for analysis or layout with only minimal styling tweaks.

Conclusion

The evaluations presented in both parts demonstrate clearly that no single extraction tool is optimal across all document types and use cases. Fluree supports a range of OCR and NLP models—including open-source tools like Docling, Tesseract  as well as cloud-based proprietary models such as Gemini and Sonnet —each with distinct strengths and limitations.

For example, cloud-based models like Gemini excel at extracting complex structures, handwritten text, and image-embedded content, but come with trade-offs in latency, cost, and potential data confidentiality considerations. On the other hand, local tools like Docling provide fast, cost-effective, and private processing for the majority of standard documents, but their performance decreases with degraded images or handwriting.

These results highlight the necessity of adapting the extraction method to the document type. Fluree’s hybrid approach addresses this by enabling organizations to select and apply the most appropriate tool or combination of tools for each scenario. This allows the bulk of processing to occur locally—maintaining control and reducing expenses—while delegating more complex pages to advanced cloud models only when necessary, thus balancing accuracy, cost, and operational constraints.

Bringing Value to Unstructured Data with Fluree

Documents and unstructured data represent a vast and largely untapped source of information for organizations. Extracting meaningful insights from these sources requires more than just OCR or NLP — it demands an integrated platform that can organize, secure, and make this data queryable in flexible ways.

Fluree addresses this challenge by converting extracted data into rich, cryptographically secured knowledge graphs. These graphs enable organizations to model relationships, provenance, and context in ways that traditional databases cannot. Once ingested, the data becomes immediately accessible and explorable using tools like Fluree’s natural language query interface, Flurio, which allows non-technical users to query complex data sets intuitively.

The platform’s blockchain-inspired architecture ensures data integrity and auditability, crucial for compliance and trust in sensitive environments. By integrating document extraction pipelines directly with this graph-based backend, Fluree turns unstructured documents—from contracts, manuals, reports, and more—into strategic assets.

This unified approach means organizations can seamlessly ingest diverse document types, orchestrate multiple extraction models, and store the results securely in a flexible data model designed for long-term value extraction. In other words, Fluree not only helps you extract data but also ensures that the data can be confidently governed, queried, and leveraged for business insights.

Dive in for more information here Fluree and here Content Auto-Tagging and Classification | Fluree