Fluree Blog Blog Post Arnaud Cassaigne07.21.25

Understanding Intelligent Document Extraction with Fluree: Part 1 – LLMs 

Fluree leverages state-of-the-art LLMs within a secure knowledge-graph database for intelligent document extraction. This is part 1 of an article which highlights Fluree's process, the various proprietary and open-source models used for tasks like structured text, table, and handwritten extraction, and a comparative analysis of their performance, cost, and speed. This article was co-authored by Arnaud Cassaigne and Aness Hmiri.

The rapid growth of unstructured digital content—think sprawling PDFs packed with dense text, hidden tables, technical diagrams, or even faded handwritten notes—creates a pressing need for accurate, trusted extraction pipelines. Fluree’s value goes far beyond merely choosing the “right” optical‑character‑recognition (OCR) or large‑language‑model (LLM) engine. By embedding those engines inside a secure knowledge‑graph database that provides built‑in data lineage, fine‑grained governance, and cryptographic traceability, Fluree turns raw extractions into verifiable, reusable knowledge assets. 

In Fluree’s processing chain, proprietary and open‑source models are interchangeable tools—important, but only one layer in a larger value‑creation stack that also includes:

  • Semantic enrichment that links newly extracted facts to controlled vocabularies and ontologies.
  • Rule‑based validation that detects and corrects model hallucinations or formatting errors.
  • GraphRAG‑powered retrieval that lets downstream applications query the fresh knowledge with complete provenance. 

Because the platform orchestrates, scores, and even crosschecks the outputs of multiple engines, organizations get the best of each model while retaining full control over cost, privacy, and accuracy.

Document Extraction Tasks

Fluree leverages both proprietary and open-source language models to efficiently extract and structure critical information from complex documents. Below are the core extraction tasks powered by these models:

  • Extracting Structured Text from PDFs:
    Fluree identifies and maintains the original text hierarchy in PDFs, such as headings, subheadings, and paragraphs. This enables clear and coherent reuse of extracted information in various digital formats and applications.
  • Table Extraction from PDFs:
    Even complex tables embedded deep within extensive text are accurately recognized and converted by Fluree into structured and reusable formats. This ensures reliable integration into analytical tools or databases, enhancing immediate usability.
  • Handwritten Text Extraction from Scanned Documents:
    Fluree effectively digitizes handwritten notes and archival documents, utilizing sophisticated optical character recognition (OCR) to restore handwritten content as digital text. This facilitates clearer archival management and easier access to historically valuable information.
  • Text Extraction from Embedded Images within PDFs:
    Fluree extracts text from images within PDFs, capturing essential explanatory details and metadata. This process significantly improves the comprehensibility and immediate usability of visual content, such as scientific diagrams or technical illustrations, for in-depth analysis and informed decision-making.

Large Language Models Used by Fluree 

Proprietary Models

  • GPT-4o (OpenAI)
    Launched in May 2024, GPT-4o is a multimodal model capable of processing text, images, and audio in real-time. It supports a context window of 128,000 tokens, enabling natural and efficient document processing.
  • Claude Sonnet 4 (Anthropic)
    Introduced in May 2025, Claude Sonnet 4 specializes in reasoning and planning, providing rapid responses and enhanced conversational memory for complex tasks.
  • Gemini 2.5 Pro (Google DeepMind)
    Released in March 2025, Gemini 2.5 Pro offers extensive capabilities in complex reasoning, coding tasks, and multimodal comprehension with a context window of one million tokens.

Open-Source Models

  • Mistral OCR (Mistral AI)
    Launched in March 2025, Mistral OCR is optimized specifically for optical character recognition tasks, reliably interpreting document elements like text, tables, and equations.
  • Gemma 3 (27B IT) (Google)
    Available since March 2025, Gemma 3 is a versatile model featuring 27 billion parameters, providing multimodal capabilities and supporting contexts up to 128,000 tokens.
  • Llama 4 (17B x128E) (Meta)
    Announced in April 2025, Llama 4 Maverick employs a “mixture-of-experts” approach with 128 experts, totaling 400 billion parameters, suited for general-purpose extraction tasks and conversational support.

Comparative Analysis

The comparative analysis highlights distinct strengths among evaluated models:

  • Gemini 2.5 Pro and Sonnet 4 excel across structure and table extraction tasks, providing highly accurate and reliable outputs. Gemini notably achieves perfect results, positioning it as ideal for demanding applications requiring minimal errors.
  • GPT-4o and Mistral OCR exhibit limitations, notably refusing or struggling with extraction tasks, particularly in structure and table extraction. Despite this, GPT-4o offers quicker execution times, potentially beneficial in scenarios prioritizing speed over precision.
  • Open-source models Gemma 3 and Llama 4 demonstrate significant speed and low operational costs but perform poorly in handwritten extraction and OCR tasks. Their affordability and speed make them suitable for simpler, high-volume tasks where accuracy is less critical.

These insights allow organizations to choose models that align closely with their specific needs regarding accuracy, speed, and budget.

In the next article, we will explore alternative, lighter extraction models like Tesseract and Docling, suitable for scenarios demanding fewer computational resources.

Case Studies of Document Extraction with Gemini 2.5 Pro

  • Handwritten Text Extraction (Example: Historical Birth Record)
    Gemini successfully transcribed a complex handwritten birth record written in dense, cursive French from the 19th century. The extracted text maintained accuracy and readability, preserving the semantic context such as signatures and official roles (e.g., mayor, witnesses), showcasing Gemini’s advanced ability to interpret intricate handwriting.

Technical Diagram Interpretation (Example: Lithium-Ion Battery Schematic)
Gemini effectively analyzed and structured data from an annotated lithium-ion battery schematic. It provided structured metadata including titles, clear scientific descriptions, and correctly identified component terms (e.g., “Cathode,” “Graphite,” “Separator”). Gemini’s interpretation capabilities extend beyond OCR, offering meaningful insights into the diagrams’ functionalities.

  • Image Text Recovery via Contextual Reasoning (Example: ASU Shuttle Bus)
    In a challenging scenario where text on an ASU shuttle bus was partially illegible due to glare and resolution, Gemini successfully inferred the complete text by contextualizing surrounding visual and textual clues. This illustrates Gemini’s robust semantic reasoning capabilities, surpassing the limitations of conventional OCR.

Archival Table Digitization (Example: 1894 Birth Registry)
Gemini digitized a historical handwritten birth registry, accurately converting varied handwriting styles and data formatting into a structured digital table. It maintained data integrity, including correct ordering, spellings, and numerical details, confirming Gemini’s proficiency in handling legacy documents.

  • Financial Document Parsing (Example: Consolidated Balance Sheet)
    Gemini reliably parsed a complex balance sheet, accurately segmenting it into structured sections (Assets, Liabilities, Equity) and preserving hierarchical relationships and multilingual headers. The digitized output closely matched the original document’s layout, demonstrating precision in managing structured financial records.

Conclusion and Outlook – Part 1

This first article demonstrated how Fluree pairs state‑of‑the‑art LLMs with its secure knowledge‑graph platform to transform intractable PDFs into trustable data. Fluree’s architecture enables orchestration, cross‑comparison, and post‑processing of multiple models—extracting the best of each while ensuring traceable, reusable outputs.

Our comparative analysis revealed that today’s top-tier LLMs—particularly Gemini 2.5 Pro and Claude Sonnet 4—can achieve near‑perfect extraction quality, even on handwritten and highly structured content. However, this comes at a significant computational and financial cost. Open-source models like Llama 4 and Gemma 3, while more affordable and faster, offer limited performance in high-accuracy scenarios such as OCR and layout extraction.

In Part 2, we’ll explore cost‑efficient alternatives—including Docling and Tesseract—and show how Fluree’s knowledge-graph pipeline enables hybrid approaches that trade a small drop in accuracy for dramatically lower cost and latency.