Thought Leadership Kevin Doubleday09.19.23

Making Data FAIR(EST)

Extending the FAIR principles to enable trusted collaboration and instant composability in data.

In 2020, we published a blog post on the FAIR principles for data management

As a quick recap, the FAIR principles of data management emphasize Findability, Accessibility, Interoperability, and Reusability of data. These principles were outlined in a 2016 article published by Scientific Data as a way to curb silos and promote broader collaboration within the academic community. Here’s a quick summary of the FAIR data principles:

Read more here: 

What does FAIR have to do with enterprise data? 

Like scholarly data, enterprise information today is lost in silos, rarely re-used, integrated, or leveraged in a meaningful way beyond its original purpose. 

We call this forever-lost information “dark data.” According to CIO magazine, 40% to 90% of enterprise data is considered “dark”, depending on the industry.

Of the information that is potentially available for reuse through various extraction methods such as ETL, APIs, or data clouds, quality is often so far below par that it might be incomprehensible without expensive data engineering to normalize various information. We call this information “dirty data.”  Experian recently reported that on average, U.S. organizations believe 32 percent of their data is inaccurate. The correlated impact is equally staggering: in the US alone, bad data costs businesses $3 Trillion Per Year 

Dirty and lost data are commonplace at organizations of all sizes, resulting in lost time and money. 

Behind these problems are the broken promises of “Big Data,” an illusion sold and perpetuated that an abundance of data and some fancy analytics tools could unlock a new era of knowledge discovery and decision-making. However, blindly implementing big data solutions often requires substantial investments in technology, infrastructure, and personnel. Moreover, the time and effort required to integrate disparate data sources and ensure data quality often outweighed the potential savings. In many cases, the costs associated with big data initiatives exceeded the benefits, leaving organizations disillusioned. 

This is why Gartner recently predicted that 70% of organizations will shift their focus from Big Data to “Small and Wide” data, emphasizing the importance of high-quality, linked information over high quantities of low-quality data. Brian Platz, CEO of Fluree, covered this idea in 2022 with a Forbes opinion piece entitled How Small Data Is Fulfilling The Big Data Promise.

What does this have to do with FAIR? The FAIR principles provide an agnostic but prescriptive framework for making data high quality, accessible, and reusable so that all stakeholders along the data value chain can glean insights with minimal friction. Applying the FAIR principles as a governance framework can help organizations reduce the risk of dark or dirty data. 

Adding “EST” 

Today, we are making the case to extend the FAIR principles to include notions of extensibility, security, and trust. While FAIR provides an excellent framework for open data reuse, these three additions contribute guidance for organizations looking to treat data as a strategic, collaborative asset across boundaries. 

Data-centricity is the ethos driving these additional principles. In a data-centric architecture, many of the capabilities required to share information across applications, business units, or organizational boundaries are pushed from the application layer (software) to the data layer (database). Organizations moving to a data-centric architecture begin to strip away the layers of middleware and software that seek to accomplish specific tasks related to interoperability, trust, security, and sharing and instead focus on data-centric and model-driven architectures that enable composability and collaboration out of the box. 

Let’s dive into the proposed appendix to FAIR(EST): 

E – Extensibility

Data extensibility involves the capacity to expand a data model dynamically for additional capabilities while preserving its original structure and meaning. 

In a data-centric architecture, data is the central product, while “agents” such as applications, data science workflows, or machine learning systems interact with and revolve around this core of interconnected data. This requires the data to be useful in a variety of contexts, and, importantly, freed from proprietary formatting. 

Leveraging standardized vocabularies, in the form of semantic ontologies, data producers and consumers can extend data’s potential value and applicability across operational or analytical contexts. Relational systems are rigid and often proprietary; the opposite is true for semantic graph databases, which are flexible, extendable, and built on open standards. 

Extensibility through open semantics standards allows data models to grow and adapt as new analytical, operational, or regulatory data requirements emerge. This saves time and resources, as organizations can extend data models as needed instead of creating entirely new ones or investing in a mess of ETLs and disconnected data lakes. 

S – Security 

While FAIR alone provides an excellent framework for interoperability and usability in spaces where data is an open resource, enterprise ecosystems often operate within closed or hybrid boundaries where privacy, compliance, and competitive advantage are key factors. In 2020, we presented on data-centric security at the DCAF (Data-Centric Architecture Forum), making the case for functions related to identity and access management to be accomplished at the data layer, as data.  In a data-centric security framework, security policies and protocols are defined and enforced at the data layer, rather than deferred to a server, application, or network. 

We call this “data defending itself.” Security is baked in, and thus inseparable from the data it protects. Using these powerful embedded operations, we can turn business logic into enforced policies that travel alongside the data, forever. 

Enabling data-centric security opens data up to become more immediately and freely available. Within this framework, we can open up our data sets to be queried directly by virtually anyone without moving it or building specific APIs that abstract certain elements; data will be filtered out according to the established rules associated with the user’s identity. 

Read more on data-centric security here

T – Trust

We make the case in our data-centric series that in order for data to be effectively shared across trust boundaries, it must have digital authenticity inherently built in. If we are to share data dynamically across departments, partners, and other stakeholders, that information should come pre-packaged with proof of where it came from and that it did in fact originate from a trusted source. 

A few elements make up what we call “trust” when it comes to information: 

  • Data Provenance: We can prove who (or what) originated data, ensuring it came from an authoritative source. 
  • Data Lineage & Traceability: We have comprehensive visibility into the complete path of changes to data: who has accessed or updated a piece of data in a system, when, and under what circumstances.
  • Data Integrity – We can detect data tampering at any level of granularity (down to the pixel of an image or a single letter in a text document).
  • Identity Management – We can control and prove the identity (user or machine) associated with any of the above events (data origination, changes, access).
  • Proof – We (humans or machines) can prove the above criteria using standard math and cryptography.  Taking “Triple A” (Authentication, Authorization, and Audit) to the next level.

As more and more data consumers are unleashed onto enterprise data (*clears throat* AI), the imperative to ensure the digital authenticity of information becomes critical. This is a data quality, risk, and operational issue that cannot be addressed tomorrow. 

Read more about data-centric trust here


FAIR(EST) can be taken as a high-level framework to move your organization away from legacy data infrastructure. Modeling FAIR data future-proofs that information for re-use. Adding extensibility, security, and trust to your data enables true composability across boundaries without expensive data engineering or data security issues. 

The result might closely align with what our industry calls “data mesh”, a lightweight infrastructure for decentralized sources to collaborate on data across technological and organizational boundaries. However you might define it, the path to FAIR(EST) is a technological and cultural journey worth hiking.

We will cover how organizations can take their first (second and third) steps to achieve data-centricity in our upcoming webinar with Dave McComb from Semantic Arts. Check it out here: Data-Centric Strategies to Break the Cycle of Legacy Addiction

What other additions would you make to the FAIR principles? Email us at [email protected].    

Knowledge graphs are powerful frameworks for organizing, linking, and sharing data with universal meaning. There is an ongoing debate about which graph data model is best, and in this blog post, we’ll explore why RDF (Resource Description Framework) stands out as the superior choice for building more sustainable and scalable knowledge graphs over LPG (Labeled Property Graphs). Let’s dive into the reasons that make RDF shine as the backbone of knowledge graphs.

RDF Versus LPG: A comparison

In RDF, data is represented as triples that consist of subject, predicate, and object. These are known RDF statements, i.e. “Alice is a Friend of Jack.” These triples form a directed graph, with the subject and object as nodes and the predicate as labeled edge. LPG uses nodes to represent data elements and labeled edges to connect nodes. Each edge represents a relationship between nodes, and nodes can have properties associated with them.

In the RDF world, the above graph would be comprised of the following set of triple statements:

Subject | Predicate | Object

Jill | Likes | Artwork

Jill | Is a Friend of | Jack

Jack | Is a Friend of | Kevin

Kevin | Likes | Artwork

There are some key differences that make RDF a better choice for enterprise knowledge graphs. Let’s explore:

Why do these differences matter?  

While LPGs can certainly be useful for graph analytics use cases, they lack key features to make data accessible and useful when it comes to the need for interoperability and scale. Specifically, LPGs lack robust support for ontologies, schema standardization, and semantic standards – all characteristics of a sustainable knowledge graph initiative. Let’s explore:

When LPG Makes Sense

It’s worth exploring the distinctive advantages of labelled property graphs (LPGs) within specific contexts.

Properties of Properties:

In the world of LPGs, edges can bear their own properties, allowing for the inclusion of valuable metadata such as the source of information, date of assertion, and confidence level. This can be emulated with objects in standard RDF but unfortunately there is a performance cost. A better approach is RDF-* and SPARQL-*, which is about to be standardized, and is already supported by several RDF databases.

Property Path Discovery:

LGPs have an advantageous ability to traverse the shortest path between two nodes, which is valuable for unraveling complex relationships and uncovering hidden insights. While this may not be a pivotal concern for all domains, it holds particular significance for those where relationships are multi-faceted and require in-depth exploration. This is only important for certain domains, such as social networks, so it is of limited importance. However, there is nothing about property graphs that is special here, as it’s just a function of the query language. Unfortunately, SPARQL doesn’t include this, but any RDF database (including Fluree) can implement this operation, if the application demands for it.

It’s important to note that LPGs shine brightest in these limited scenarios where edge properties and path discovery are paramount. The beauty of the RDF ecosystem lies in its versatility and adaptability. The introduction of RDF-* and SPARQL-* speaks to the growing recognition of these specialized features within the RDF paradigm.


When it comes to data management, choosing the right tool for the job is best practice. If your graph use case is heavily analytical within a closed environment and demands “properties of properties,” LGP may play a role. But if your knowledge graph demands interoperability, integration, and sharing across boundaries, RDF is the clear format of choice to future-proof your data across its value chain.

The most valuable knowledge graphs are flexible and dynamic, adapting to new data types, consumer patterns, and business requirements. To meet these demands, Knowledge Graphs must leverage the benefits of semantic standards and “open world” assumptions that RDF provides.

If you’re looking for a native RDF-native graph database, Fluree is an excellent choice for knowledge graph initiatives. With Fluree, your knowledge graph can expand across domains, power any number of applications, and extend both “read” and “write” capabilities to permissioned users.

Most Knowledge Graphs are read only with limited write capabilities. In this article, we explore making Knowledge Graphs dynamically read and write to power front, middle, and back office operations dynamically.

The most valuable knowledge graphs are flexible and dynamic, adapting to new data types, consumer requirements, and business patterns. However, most knowledge graphs are simply seen as analytical layers above the true source data, and serve a limited scope of analytical cases. In this piece, we will explore how organizations can use knowledge graphs as their system for operational applications in addition to standard analytics. Knowledge Graphs must extend their value from analytical tools to being able to power operational applications across a broad suite of business domains.


In the era of information overload, the need for effective data management and utilization has become increasingly important. Knowledge graphs have emerged as a powerful tool for organizing and connecting vast amounts of data, enabling valuable insights and supporting decision-making processes. While knowledge graphs have predominantly been used for analytical purposes, there is immense potential in making them operational, thereby transforming insights into action. This blog post explores the concept of operational knowledge graphs, their benefits, and strategies for making them a practical reality.

Understanding Knowledge Graphs

A knowledge graph is a structured representation of knowledge that captures relationships between entities, attributes, and concepts. It consists of interconnected nodes (entities) and edges (relationships) that provide context and meaning to the data. Traditionally, knowledge graphs have been utilized for analytical tasks, such as data exploration, semantic search, and recommendation systems. However, their potential extends far beyond analysis, enabling organizations to operationalize their knowledge for enhanced decision-making and automation.

Benefits of Operational Knowledge Graphs

Knowledge Graphs shouldn’t just act as a disconnected layer away from your operational data – they should dynamically reflect, inform and drive your business. What if you could make this a reality?

Contextualized Decision-making: By connecting diverse data sources and organizing them into a knowledge graph, organizations can gain a holistic view of their data landscape. This allows decision-makers to access real-time, updated, and contextualized information, leading to more informed and confident decisions.

Efficient Data Integration: Operational knowledge graphs provide a framework for integrating data from multiple sources, such as databases, APIs, and external repositories. This seamless integration enables a unified data model that can be easily accessed and utilized across various applications and systems.

Real-time Insights: With operational knowledge graphs, organizations can leverage real-time data to derive actionable insights. By continuously updating the graph with the latest information, businesses can stay ahead of the curve and respond promptly to changing market conditions.

Automation and Intelligent Systems: Operational knowledge graphs form the foundation for building intelligent systems and automating complex processes. By encoding domain-specific knowledge and rules, organizations can create intelligent workflows, chatbots, and recommendation engines that can make autonomous decisions based on the knowledge graph’s insights.

Making it a reality

At Fluree, we always recommend a four-step cyclical process: model, map, connect and expand

Model: Start with a domain and model it — this could be a business application schema, business or industry ontology, or a standardized schema from Develop the conceptual model to represent entities, attributes, and relationships within the knowledge graph. Use standard semantic web technologies, such as RDF (Resource Description Framework) to define the schema and capture domain-specific knowledge.

Map: Identify relevant data sources and design a strategy to map them to your model. This may involve data extraction, transformation, and loading processes to ensure data quality and consistency, as well as entity resolution that ensures data is correctly linked and represented within the knowledge graph. 

Connect: Integrate the operational knowledge graph into various applications, systems, and decision-making processes. Once modeled, mapped, and transformed into an operational knowledge Graph database, you can now connect any kind of consumption pattern. This can involve building data-driven applications, embedding the graph into existing systems, or creating APIs that expose the graph’s insights to other applications. A few typical consumption patterns may include:

This may also include implementing efficient query and search mechanisms that allow users to explore and retrieve information from the knowledge graph. 

With Fluree, you can provide policy-driven behavior to extend governed read or write access to every user, system, or domain.

Expand: Extend your data model to represent new business domains, continuously update your knowledge graph with new data from existing or new sources, and expand knowledge graph capabilities to empower new users and satisfy new business needs.

Get Started with Fluree

Wherever you are in your Knowledge Graph journey, Fluree can help you accelerate and expand your knowledge graph across domains, power any number of applications, and extend both “read” and “write” capabilities to permissioned users. Learn more here.


Knowledge Graphs are as complex as they are game-changing. As your organization evolves to become more taking steps to bring knowledge graphs closer to operational data will provide your organization with a foundation for success.  It’s important to start small with one domain, prove value, and iteratively build upon early success.

We’ve all heard that “data is the new oil” or a similar analogy to describe the potential business value of enterprise information. But can that oil be found? According to a new Forrester report, the answer is likely not, seeing as employees lose 12 hours a week chasing data on average. 

Can that data be leveraged? Forrester has the grim answer again: between 60 percent and 73 percent of all data within an enterprise goes unused for analytics. 

While we’ve made impressive strides in IT to accomplish tasks at scale and speed (storage, compute, AI), we seem to have treated data as a by-product of these functions, without accounting for the need to re-use or share that data beyond its originating source system. 

More specifically, we’ve treated data as a siloed by-product of the average 367 software apps and systems large organizations use to manage their workflows, none of which “speak the same language.” 

As a result, we are left with sprawling, disconnected heterogeneous data sources that are potentially duplicated, out-of-date, inaccessible, and most likely never used. 

It’s no surprise that “democratizing data across organizations” is on the top of most Chief Data Officers’ priority list, but how do we accomplish this at scale and without adding yet-another-data silo in the form of a fancy new data lake or warehouse? 

Chief Data Officers and other professionals in the enterprise data management space are turning to knowledge graphs as the desired tool to connect disparate heterogeneous data assets across organizational disciplines. Gartner predicts that by 2025, graph technologies will be used in 80% of data and analytics innovations, up from 10% in 2021, facilitating rapid decision making across the organization.

What are knowledge graphs?

A knowledge graph is a database that represents knowledge as a network of entities and relationships between them. Knowledge graphs are comprised of the following elements: 

This image displays the elements of an enterprise knowledge graph, or knowledge graph.

Knowledge Graphs offer a powerful way to organize and connect data across an organization through the use of semantic standards and universal ontologies (read more on the Fluree Blog: Semantic Interoperability – exchanging data with meaning). Knowledge graph use cases are growing rapidly, as the need to connect and integrate disparate data sources grows everyday for organizations looking to play effective “data offense” and “data defense” simultaneously. 

However, despite the many benefits of knowledge graphs, most enterprises are not yet ready for such an initiative, specifically due to poor, duplicated, and non-interoperable data. Those 300+ saas application silos are contributing to poor data quality, lack of interoperability, and lack of data governance.

Let’s dive into these challenges: 

Challenge #1: Lack of Data Quality

One of the biggest challenges facing organizations looking to implement a knowledge graph is the quality of their existing data. In many cases, enterprise data is duplicated, incomplete, or simply not fit for purpose. This can lead to a range of problems, from difficulty in extracting insights from the data to confusion and errors when trying to make sense of it all. Poor data quality can be a major roadblock for any knowledge graph initiative, as it can make it difficult to build an accurate and comprehensive understanding of an organization’s data.

Challenge #2: Lack of Data Interoperability

Another challenge for knowledge graph implementation is the issue of interoperability. Most organizations have data stored in various formats and systems, making it difficult to connect the dots and derive meaningful insights from the data. In addition, many enterprises rely on proprietary software and data formats, which can make it even harder to integrate disparate data sources into a single knowledge graph. Without a standard way to connect all of their data sources, organizations are unable to build a comprehensive knowledge graph that reflects the true complexity of their business

Challenge #3: Lack of Data Governance

Lastly, many enterprises struggle with data governance and management, which can be a significant barrier to knowledge graph implementation. Data governance encompasses a wide range of practices and policies that are designed to ensure that data is managed effectively, from the way it is stored and secured to the way it is used and shared. Without robust data governance, organizations may be unable to ensure that their data is of sufficient quality and consistency to support a knowledge graph initiative. This can lead to a lack of trust in the data and make it difficult to build meaningful insights from it.

While knowledge graphs offer a powerful way to unlock the potential of enterprise data, most organizations are not yet ready for such an initiative. The challenges of poor, duplicated, and non-interoperable data, as well as data governance and management, pose significant barriers to implementation. 

And – given the state of enterprise data management – the average data source is not quite ready for inclusion in a knowledge graph.

The Path Forward

Best Practices in Prepping Enterprise Data for Knowledge Graphs

Knowledge graphs provide a powerful way to capture, organize, and analyze information from various sources, enabling organizations to gain insights that were previously hidden or difficult to access. However, preparing legacy data for an enterprise knowledge graph can be a complex and challenging process. Let’s dive into the common steps needed to build an effective enterprise knowledge graph: 

1 – Define the scope and goals of the knowledge graph project: The first step in preparing legacy data for an enterprise knowledge graph is to clearly define the scope and goals of the project. This involves identifying the data sources that will be included in the knowledge graph, the types of entities and relationships that will be represented, and the business use cases that the knowledge graph will support.

2 – Cleanse and standardize data: Before data can be added to a knowledge graph, it must be cleansed and standardized to ensure accuracy and consistency. This involves identifying and correcting errors, removing duplicate entries, and standardizing formats and values across different data sources. 

3 – Transform data into a graph-friendly format: Once data has been cleansed and standardized, it must be transformed into a graph-friendly format. This involves mapping the data to a graph schema that defines the entities and relationships that will be represented in the knowledge graph. The schema should be designed to support the business use cases and goals of the project, and it should be flexible enough to accommodate changes and additions as the knowledge graph evolves over time.

4 – Map data to graph schema: After the schema has been defined, data must be mapped to the schema to create the knowledge graph. This involves identifying the entities and relationships in the data and creating nodes and edges in the graph that represent them. The process of mapping data to a graph schema can be automated to some extent, but it often requires human input and expertise to ensure that the resulting graph accurately reflects the data.

5 – Validate and refine the knowledge graph: Once the knowledge graph has been created, it must be validated and refined to ensure that it accurately represents the data and supports the business use cases of the project. This involves testing the graph against various scenarios and use cases, refining the schema and data mappings as needed, and incorporating feedback from stakeholders and users.

The Game-Changer: AI

Most data transformation projects are costly and time consuming – these same barriers exist for any knowledge graph initiative. While we certainly need to address the above challenges (data cleanliness, interoperability, structure, and standardization), the data engineering required can bring quite the pricetag. 

Fluree Sense automates these processes: using machine learning and AI to find patterns inherent in data to help map data across multiple ontologies, Fluree Sense transforms data silos into structured, semantic data assets that are optimized for Knowledge Graph. With Fluree Sense, you can automatically transform your legacy data into a format that is compatible with your enterprise knowledge graph. 

The Fluree Sense Process

The End Result

By using Fluree Sense to prepare your legacy data for an enterprise knowledge graph, you can streamline the data preparation process and ensure that your knowledge graph is built on a solid foundation of reliable and accurate data. Data is now semantically described in multiple ontologies, and can therefore be accessed by many users within and outside a company’s four walls based on whichever vocabulary they are comfortable interacting in. Data is also saved in RDF-friendly form so that it can be loaded into KnowledgeGraphs which enable users to analyze and introspect the data using queries more powerful than traditional SQL database queries alone.

With Fluree Sense, you get best-in-class data cleansing technology that is business user-friendly.

Is Data Cleansing the Answer to my Data Problems? 

A thousand times, no. Data Cleansing is a great way to get enterprise data into a usable state, but it does not address the fundamental problem that enterprises must address: their source data is, by nature, siloed. 

The ideal scenario is that the necessity for data cleansing diminishes over time, as the underlying reasons for data problems are addressed. Without tackling the fundamental issues of native interoperability, semantics, trust, quality, and security, we will only be applying temporary fixes to a convoluted and deeply ingrained architectural problem.

We cover the fundamentals of addressing each of these “data problems” in our data-centric architecture series. Read it here.

Data sharing isn’t just about sending links. Rather, it is about giving the right people (and machines) access to the right information at the right time. It’s about keeping an eye on what you’ve shared, and ceasing to share when necessary. Increasingly, data sharing is also about feeding new innovations, such as AI apps, high-quality data that will fuel transformative ways of living and working. 

Here are some of this year’s top trends in data sharing. Some build on last year’s trends, others are new. All represent an acceleration down the road of data centricity

1. Privacy and Security Governance Across Data Lifecycle

For about two decades, companies have profited by collecting, using and selling user data. Collected anonymously, this data has long been dismissed by users as the price of keeping things like Google and social media free. 

Where there is treasure, however, there are pirates. If moral objections to the monetization of user data weren’t threatening enough, data theft certainly is. Every year, the Verizon Data Breach Investigations report comes out with new and terrifying figures. 2022 saw a 13% increase in ransomware, 80% of data breaches coming from external actors, and a slew of supply chain problems that were also hacker opportunities. 

Increasingly smart applications like facial recognition, combined with rising tensions between the world powers of the U.S, China, Russia, and Europe, mean that state actors also pose threats. Information warfare is, after all, an act of warfare, and it is in every country’s interest to protect user data. Europe’s GDPR is an early example of policy reacting to threats. 

Our societal focus on data privacy and security is only heightening. How do you know if you’re sharing your personal data with that cloud app and not some guy on the dark web as well? Is a website’s encryption enough to really protect your data, or is it, like the secure socket layer (SSL) on almost every website, prone to attacks? In 2023 more than ever, the focus will be on access controls, advanced encryption, and transparency in where your data is going, when, and to whom. 

2. Decentralized data sharing

On one hand, decentralization is a response to the security and privacy threats floating around the internet. Decentralized systems are less vulnerable to cyberattacks and data breaches because the data is distributed across multiple nodes, making it harder for hackers to access all the data at once. Users also have more control over their data, deciding who can access which data and when, and tracing data provenance (where data goes, to whom and when). The combination of control and transparency makes decentralization a natural protection against privacy- and security threats.

On the other hand, data is becoming more decentralized as more organizations participate in cross-border data sharing ecosystems. Whether mandated by law, or to gain a competitive advantage, organizations are now having to work more efficiently with third parties, integrating shared, distributed data assets. 

These systems often use peer-to-peer architecture, where each node in the system can act as both a client and a server. Systems can handle a large amount of data and concurrent connections without becoming overwhelmed, and scale up or down without significant changes to their infrastructure. 

The ability to handle a lot of users and data without becoming overwhelmed makes decentralized systems eminently compatible with the challenges and opportunities of 2023. How do you become data-driven and secure? How do you use AI without bottlenecking your systems? How do you give users more control over their own data? Decentralization has technological answers to all of those modern quandaries. 

3. Verifiable credentials

Verifiable Credentials can potentially make credential data sharing ecosystems more trusted, secure, and efficient. 

Just as the Apple- and Google wallets killed–or significantly slimmed down–the folding wallet, verifiable credentials are poised to significantly reduce dependencies on paper and plastic documents for identity or credential verification. 

From the standpoint of government agencies, title companies and anyone else who produces paper or plastic documents, verifiable credentials are inexpensive to produce and distribute. From the point of view of anyone who has ever had to send over a copy of their driver’s license, digitally sign a mortgage document, or otherwise replicate personal information for the sake of getting ahead in life, verifiable credentials offer a lot more privacy and autonomy, while also reducing costs on verifiers for automated review and approval. 

4. Interoperability 

Data interoperability enables different systems to share and exchange information seamlessly. Think of your ability to control your home’s lights, garage door, TV and so on through an app like Apple Home. Even though each object has its own sensors, Apple Home makes it easy to see and control them from one place. You can almost feel the convenience. 

The same is true of any interoperable system. Right now, it’s hard to see doctors in different healthcare systems because you need to make specific requests to begin the transfer of electronic healthcare records (EHR). If all of your EHR sat on an Apple Home-like app, and you had complete control over how much of each record to share, with whom and when, a big chunk of friction would be removed from life. 

Interoperability makes apps smarter and more convenient—the long-awaited realization of the internet being more like a virtual assistant than a cesspool of information.

Interoperability requires different engineering under the hood. For now, most data is siloed in separate systems, essentially growing stale with time. Interoperability requires teams to extract, transform and load (ETL) data from various siloed systems into a data warehouse before it can be investigated for insights and shared. 

Semantic Interoperability that uses global standards cuts out the ETL process, enabling “zero-copy-integration,” a nirvana for data sharing that reduces duplication and instead focuses on opening up “golden records” to permissioned parties. Different systems, applications and data sources can be integrated without data engineering plumbing, leading to more opportunities for innovation and less opportunities for bottlenecks or attack surfaces. Moreover, interoperable data is more easily auditable and traceable, which makes it easier to ensure that data is accurate, complete, and compliant with regulations.

Enterprises are already doing the heavy lifting to transform years of legacy data into interoperable data. Startups are focusing on interoperability while developing new products. 2023 will only see an acceleration of that trend.

5. Secure Multi-Party Computation

Multi-party computation (MPC) is a cryptographic technique that allows multiple parties to jointly compute a function over their private inputs without revealing those inputs to one another. This makes it a promising solution for secure data sharing, as it enables

parties to collaborate on data analysis and processing while maintaining the privacy and confidentiality of their data.

MPC has gained a lot of attention in recent years, as organizations increasingly seek to share data with partners or third parties in a secure manner. For instance, in the healthcare sector, hospitals may want to collaborate on medical research while keeping patient data private. In the finance sector, multiple banks may want to share data to detect fraudulent transactions while maintaining the confidentiality of customer data.

MPC can also be used to enable secure voting and auctions, and it has applications in areas such as machine learning and blockchain technology.

One of the benefits of MPC is that it does not rely on a trusted third party to coordinate the computation. Instead, it uses cryptographic protocols to ensure that each party’s private input remains hidden from the others, while allowing them to collectively compute a result. This makes it a highly secure way of sharing data, as there is no central point of vulnerability that can be exploited.

However, MPC is not without its challenges. It can be computationally intensive and may require a significant amount of communication between the parties. In addition, the complexity of the protocols involved can make it difficult to implement correctly and securely.

Despite these challenges, the trend of MPC for secure data sharing is likely to continue. As more organizations seek to share data in a secure manner, MPC offers a highly promising solution that allows them to collaborate without compromising the privacy of their data. As the field of cryptography continues to advance, it is likely that MPC will become even more efficient and practical, making it an increasingly attractive option for secure data sharing.


To sum it all up, 2023 is the year where organizations can no longer ignore data. Whether building new lines of business, making legacy systems interoperable, or exploring the exponentially growing world of AI, it’s time to place data – and data sharing architectures –at the center of your strategy. 

Let’s face it: stumbling across an internet ad that’s relevant to your interests is rare these days. What’s the problem? Wouldn’t it make sense for an industry that has been so meticulously developed over the years to be spot on at catering ads to match consumer preferences? Well, despite the astronomical amount of data collection that exists in the multi-billion dollar digital advertising industry, Meta and Google dominate what is now considered a broken market– all due to the oversaturation of bots and middleman publishers that have thrown the ad-matching algorithm off-kilter. 

Believe it or not, there’s a working solution to the problem! Fluree has partnered with Fabric to restore autonomy to consumer data and rebuild trust in an industry where skepticism is the norm.

So what is Fabric, exactly? 

Paul Taylor, Fabric founder and CEO, generated the idea for the company in 2017 with the angle of reinstating data ownership to consumers and eradicating the role of middlemen publishers like Meta and Google that collect consumer data and sell it unsanctioned to advertisers. The goal was to disrupt the current advertising paradigm by requiring every consumer and advertiser to open a bank account that would prove their identity while creating transparency and trust between both parties. Taylor introduced an Ad Marketplace connected to these bank accounts, allowing users to watch ads and get paid by advertisers for doing so. Users willingly input select demographic information, watch ads from a variety of brands, and provide feedback on how they received that advertisement. They’re subsidized with a Fabric banking card that holds funds earned from their time watching advertisements. In return, advertisers have access to high-quality, first-party, targeting data from verified users that they can use without needing to incorporate a middleman publisher.

Where does Fluree come into play? 

Fabric uses Fluree technology to enable consumers to control and monetize their own personal data without being exploited by companies like Meta and Google. Fluree CEO Brian Platz says, “With Fluree’s trusted ledger database, Fabric has built a business that seamlessly helps to improve the relationship between regular people and the brands appealing to them. Fluree seeks to work with disruptive organizations looking to build new applications and services that make data more sovereign and business models more equitable.” Fluree’s blockchain-secured database ensures that advertisers know their consumers are real. This helps to eliminate ad fraud, which cost marketers upwards of $120 billion last year, while simultaneously cryptographically protecting consumer identities. Consumers are able to harness Fluree’s blockchain technology to sell their personal data to the advertisers that they choose. Fabric CEO and Founder Paul Taylor says that “Fluree’s unique data management platform unlocks new opportunities for startups like Fabric that are disrupting traditional business models… enabling Fabric to operate in a fraud free environment.”

What makes this partnership great? 

The collaboration between Fluree and Fabric is a game changer for the digital advertising industry. Fabric’s Ad Marketplace is one where consumers can own the value of their own data and advertisers can guarantee the direct value of that data as well. That Ad Marketplace is powered and made possible by Fluree’s unique data management infrastructure, where data can cryptographically prove its own integrity and provenance, and where collaborative access to data can be governed by policy at the data layer itself. Fabric and Fluree both believe that by empowering data with additional security, integrity, and trust, the value of that data can be unlocked for consumer and enterprise alike.

A larger trend

Fabric’s use of Fluree’s technology is a perfect example of the next generation of technologies powering what we call “Web3.” The Web3 movement is chipping away at traditional business models and restoring trusted data ownership and sharing that protect consumers and empower businesses with higher quality data. 

Web3, although confused by many, is a generation of technology that we are still just stepping into. At Fluree, we believe Web3 rests on trusted data and semantic interoperability. Semantic interoperability refers to the ability of different systems and applications to understand and exchange data in a shared, meaningful way. It ensures that the data is accurate and consistent, regardless of the technology or system used to create or interpret it. By leveraging semantic interoperability, Fluree enables organizations to collaborate and share data more effectively, ultimately leading to better decision-making and business outcomes. This use case with Fabric is just the beginning of a larger trend we can expect to see, where more and more organizations harness the power of clean data to create a trusted and reliable product for consumers and vendors alike. We are excited to have set the standard for this trend and look forward to watching it progress as Web3 technology continues to develop. 

We’ve all heard of machine learning – a subset of Artificial Intelligence that can make inferences based on patterns in data. And if you’re involved in big data analytics or enterprise data management, you’ve also probably discovered the potential benefits of applying machine learning to automate certain components of the data management process. In the context of master data management, we are starting to see more use of machine learning to clean, de-duplicate, and “master” enterprise data. Today, we’ll cover the benefits of supervised machine learning with training sets before unleashing the algorithm on new data. We’ll also dive into the benefits of using human subject matter experts to apply additional feedback and training to machine learning algorithms.

What is Supervised Machine Learning?

Supervised machine learning is a type of model that is trained on labeled data, meaning that the data used to train the model includes the desired output or target variable. It is further reinforced by continuous feedback (think of “likes” or the “thumbs up” button).  The goal of supervised machine learning is to discover patterns in the input data that it was trained on, generalize the patterns, and then use those generalizations to make predictions on new, unseen or unlabeled data in the future. 

Supervised Machine Learning for Master Data Management

Supervised machine learning can be an effective approach for improving the efficiency and accuracy of master data management. The traditional method of Master Data Management involves (1) taking data from multiple data source as inputs, (2) conforming them to the MDM’s native data model for processing the data, (3) building a set of “fuzzy logic” rules for matching together entities that belong together, and then (4) defining ‘survivorship’ rules for merging the data from the source systems into one Golden Record for each entity. 

These traditional methods of conforming the source data, defining the matching rules, and then defining the survivorship rules can be time-consuming as they are often discovered through trial-and-error and lots of experimentation.  They are also prone to failure as they cannot necessarily account for variability or significant changes to the data source inputs. Supervised machine learning, on the other hand, uses training data to recognize patterns and discover the logic for conforming, matching and merging the data – without needing analysts and engineers to manually hand-author the rules in advance.

In the context of Master Data Management, supervised machine learning can be used to (1) identify records which are of the same entity type; (2) cluster records and identify matches between them; and (3) figure out which data values to survive when creating the golden record.

For example, by providing some samples of labeled data of one entity type – say an Organization – machine learning-based data classification models can scan data from a variety of data sources and look for other tables and records of a similar Organization type.  Then, other machine learning algorithms can learn under what conditions records of the same entity should be matched together.  Is it if two entities have the same name?  What if the name has changed but the addresses are the same?  What if the record for one company has the name “Acme Inc.” but another record has a company with the name of “Acme LLC”?  Are they the same? 

Trying to work out all of the possible permutations of rules to discover when something is or isn’t a match can take forever, but for machine learning models that is quite easy to do.  And, depending on the amount of training provided, the resulting models can be more accurate and efficient compared to hand-written rules.

Key Benefits: Speed, Scalability, & Accuracy

Supervised machine learning can bring many benefits to master data management (MDM) by improving speed, scalability and accuracy of the data-mastering process.

Firstly, supervised machine learning can significantly speed up the process of getting an MDM program initiated.  Most MDM projects can take weeks to start, as Developers first have to do a bunch of ETL and data engineering to take the source data and conform it to the MDM system’s native data model.  Then, once the data is loaded into the MDM system, the Developers can now start programming in the matching and survivorship rules, in collaboration with Business Analysts.  After iterative tuning and manually recalibrating the rules (e.g., “match the records together if the names match by 70% instead of 75%”), records across common entities can be clustered together to start generating Golden Records.  But this process may have already taken months before you see any results. By reducing the need for manual ETL and rule-tuning efforts, machine learning eliminates the most time-consuming part of data mastering and can start generating results within days or at worst, weeks.

Secondly, supervised machine learning allows for greater scalability when dealing with both large amounts and a large variety of data sources. A traditional rules-based approach can be effective for a reasonable number of datasets, but when the volume or variety of data sources being mastered changes, the rules authored before may no longer apply.  Each new data source would then require its own data transformation, rule definition for matching, and rule definition for survivorship.  By contrast, machine learning processes get smarter the more data is fed into it.  Instead of the amount of time and effort scaling linearly with each new data source under a traditional rules-based MDM approach, the amount of time and effort reduces exponentially in a machine learning-based approach.

Lastly, supervised machine learning improves accuracy in data mastering. A rules-based data mastering solution can only match records that fit the exact conditions of the rules, and developers have to define the precise conditions of the rules (e.g., “these two entities can be matched if the names match by 50%, and the addresses match by 60%”). However, most rules typically work 80-90% as defined; in very large data volumes even a 5% error rate can be a big deal!  With machine learning, patterns with much richer features than the hand-written rules can be discovered which can improve the accuracy by up to 98 or 99%. This improves trust in data and enables informed business decisions.

Supervised Machine Learning in Action: Comparing Time-To-Value in Customer Data Mastering

Let’s explore a scenario in the financial services industry where Supervised Machine Learning will significantly reduce overhead and time-to-value in data mastering. In this scenario, a bank needs to maintain a master list of all its customers, which includes information such as name, address, phone number, and account information. The bank receives customer data from multiple sources, including online applications, branch visits, and third-party providers.

A diagram showing supervised machine learning in action. Supervised machine learning is used to scan data inputs from multiple sources and create a golden record of the most accurate and correct information.

Without machine learning, the bank would have to manually review and cleanse the data, which would be a time-consuming and costly process. The bank would have to manually identify duplicates, correct errors, and standardize the data. This process would require a large number of data analysts and would take several months to complete.

However, by using supervised machine learning, the bank can automate the data cleansing process. The bank can provide the machine learning model with a set of labeled data, which includes examples of correct and incorrect data. The model can then learn to identify patterns in the data and make predictions about the data it has not seen before. The bank can also use the model to identify and merge duplicate records, correct errors, and standardize the data.

In this scenario, supervised machine learning can significantly reduce the cost and effort required to maintain a master list of customers. The bank can reduce the number of data analysts required and complete the data cleansing process in a fraction of the time it would take using traditional methods. This allows the bank to focus on more important tasks and make better use of its resources.

Supervised Machine Learning + Human Subject Matter Experts = Data Mastering ‘Nirvana’

Pairing supervised machine learning with a continuous process of “crowdsourcing” feedback from human subject matter experts can bring additional benefits to data mastering. Subject matter experts have a deep understanding of the data and the business context in which it is used. They can provide valuable insights into the data and help identify patterns and relationships that may be difficult for the machine-learning model to detect on its own.

The image reflects how supervised machine learning plus human subject matter experts leads to a data mastering nirvana.

One of the main benefits of pairing machine learning with subject matter experts is the ability to improve the accuracy of the model. Subject matter experts can provide the machine learning model with labeled data that is accurate and representative of real-world data. This can help the model learn to make more accurate predictions and reduce the number of errors. Additionally, subject matter experts can also help identify and correct errors made by the model during the training process, which further improves the accuracy of the model.

Another benefit of pairing machine learning with subject matter experts is the ability to improve the interpretability of the model. Machine learning models can be opaque and difficult to understand, which can make it challenging to trust the results and take action on them. Subject matter experts can help explain the model’s predictions and provide context to the results, which increases the transparency and interpretability of the model.

Finally, involving subject matter experts in the training process helps to improve the acceptance and adoption of the machine learning solution among the users. Subject matter experts can act as a bridge between the data science team and the end users, communicating the value of the solution and addressing any concerns that may arise. It also allows experts to set the right expectations and ensure the solution addresses real business needs.

Fluree Sense for Automated Data Mastering

Fluree Sense uses supervised machine learning trained by subject matter experts to ingest, classify, and remediate disparate legacy data. Fluree Sense is perfect for getting your existing legacy data into shape. Fluree Sense uses supervised machine learning to automatically identify and correct errors, merge duplicate records, and standardize data. The machine learning model is trained by human subject matter experts who provide labeled data that is accurate and representative of real-world data. This ensures that the model is able to make accurate predictions and reduce errors.

The image outlines the Fluree Sense advantage, showing a timeline of the fluree sense process compared to typical data management approaches. The fluree sense process generally spans over 8 weeks, led by 2 business analysts and two data engineers, at around 60 thousand dollars in labor cost. A traditional and typical approach spans over 30 weeks, using 3 business analysts and 6-8 data engineers, at around 650 thousand dollars in labor costs.

Fluree Sense offers an interactive interface that allows for easy monitoring of the data cleansing process and allows for real-time feedback and adjustments by human subject matter experts. This ensures that the data cleansing process is fully transparent and interpretable, which increases trust in the data and enables informed business decisions. Don’t wait to transform your data, start your Fluree Sense journey here!

What is Dirty Data? 

Simply put, dirty data is information that is incorrect, outdated, inconsistent, or incomplete. Dirty data may take the simple form of a spelling error or an invalid date, or a complex one such as inconsistency because some systems are reporting one set of facts that may have been updated more recently somewhere else. In all cases, dirty data can have massive negative consequences on daily operations or data analytics. Without good data, how are we supposed to make good decisions? 

How Common is Dirty Data? 

The short answer is: very common. In fact, it’s systemic. Experian recently reported that on average, U.S. organizations believe 32 percent of their data is inaccurate. The correlated impact is equally staggering: in the US alone, bad data costs businesses $3 Trillion Per Year

And yet, most organizations believe that data is a strategic asset and that extracting the value out of data is one of the most strategic business imperatives today.  So, how did we get here where something so valuable is still so hard to clean and maintain? 

Dirty Data’s Origin Story

The reason that Dirty Data is systemic is that its creation is baked into today’s business-as-usual processes for how data is being used.  And this is because most companies’ relationship with data has evolved, but the data management processes and technologies have not kept up with the change.

Data is the by-product of a business function

Data was originally the by-product of a business function, such as Sales or Order Fulfillment.  Each business function focused on optimizing its business processes through digitization, process automation, and software.  Business applications produced the data that were needed to operate a process, in a data model that was best understood by the application.  In the beginning, there may have been some manual data entry errors here or there, but by and large most business functions maintained the data integrity and quality at the levels required for it to successfully execute its operations.

However, with the introduction of the Internet and Big Data over 20 years ago, we learned that data has extraordinary intrinsic value beyond operating individual business functions.  When data from across functions, product lines, and geographies was correlated, combined, mixed, and merged along with other external data, it became possible to generate new insights that could lead to innovations in customer experience, product design, pricing, sales, and service delivery.  Data analytics emerged as the hot buzzword and companies began investing in their analytic infrastructure.  But, there were a few inherent problems with how data analytics was being done:

Dirty Data – A Product of Its Environment

At its core – Dirty Data was the unintended consequence of two factors:

1. To be of maximum value, data needs to be used beyond its original intended purpose, by actors in an enterprise outside of the originating business function.

2. Each time we use data, we are forced to copy and transform it.  The more we copy and transform, the more we create opportunities for data to lose integrity and quality.

Many organizations have begun adopting Data Governance processes to grapple with the viral growth and transmission of Dirty Data.  These include processes for managing what should be the authoritative sources for each type of data; collecting and managing meta-data (the result of the data lifecycle processes, such as what is being described in this data, who created it, and where it comes from); and implementing data quality measurement and remediation processes.

However, in order to prioritize, most data governance processes depend on first identifying Critical Data Elements (or CDEs), which are the attributes of data that are most important to certain business functions, and then wrapping data policy and control around those.  While it makes sense to focus first on certain key data features that are the most useful first, this is still based on defining what is critical based on what we know today.  As we’ve seen, what is not considered critical today could become the most valuable asset in the future.  This means that we need to evolve beyond data governance through CDEs if we truly want to eliminate Dirty Data from existence.

How do we move forward?

First, we must enhance and accelerate the data governance activities that have been proven to clean existing dirty data. In other words, stop the bleeding. But then, we need to address Dirty Data at its root by rebuilding the underlying data infrastructure to do away with the need to copy and transform data in order for it to be used across functions. Let’s explore each step further:

Step 1: Clean up your existing dirty data

1. Move Data Literacy and Data Management to the Front (to the Business)

One of the big challenges in most data governance processes is that they are considered IT problems. IT Developers engineered transformation scripts and data pipelines to copy and move data all over the place, and IT Developers are being asked to create more scripts and build new pipelines to remediate data quality issues. However, most IT Developers don’t actually understand the full context as to how the data is being used, or what the terms really mean. Only the individuals in the Front of the house, who build products or interact with customers, know what the data means for sure.

Business domain subject experts can quickly tell just by looking at it whether certain data is clean or dirty, based on institutional knowledge built over years of effectively doing their jobs. The only challenge is that this knowledge is stuck in their heads. Governance processes that can effectively institutionalize business subject matter expert knowledge are the ones that will be most successful.  This can be accomplished by making data literacy and data accountability part of the core business culture, implementing data management procedures with business domain experts in key roles, and supplementing this with techniques and technologies such as Knowledge Graphs and crowdsourcing, that can collect and store business knowledge in a highly collaborative and reusable fashion.

2. Standardize Data Using a Universal Vocabulary

Because we need to transform data for each use case based on that use case’s target consumption model, another way to stop the bleeding is to start building a reusable and comprehensive vocabulary or business data glossary that is designed to address most use case needs, now and as best as we can anticipate the future.  Creating one consistent canonical semantic model inside an organization reduces the need for continuously creating new consumption models and then creating new ETL scripts and pipelines.

Organizations can accomplish this by building a framework for semantic interoperability – using standards to define information and relationships under a universal vocabulary. Read more on why semantic interoperability matters here.

3. Apply Machine Learning Where it Makes Sense

Let’s face it – most data-cleaning projects can snowball into massive undertakings, especially with increased data complexity or obscurities that simply cannot be solved manually at scale.

Machine learning can outpace the standard ETL pipeline process by applying an algorithmic process to data cleansing and enrichment. A machine-learning approach to data transformation embraces complexity and will get better and faster over time. With machine learning, we can break past the subjective limitations of CDEs and treat any and all data attributes as if they are important.  It’s important to pair the crowdsource of knowledge from business Subject Matter Experts (#1) with machine learning to generate the most accurate outcomes the quickest. 

Step 2: Get rid of dirty data for good

In order to rid dirty data for good, you must take a step back and apply transformation at the data infrastructure level. 

Too many data governance and quality investments focus on cleaning already dirty data, while the very sources and processes that create the environment for dirty data remain untouched. Much like the difference between preventative healthcare and treatment-centered healthcare, the opportunity costs and business risks associated with being reactive to dirty data rather than proactive in changing the environment for data are enormous. 

To proactively address dirty data as its root, we need to go back to the core architecture for how data is being created and stored, the Application-centric architecture.  In an Application-centric architecture, the data is stored as the third tier in the application stack in the application’s native vocabulary.  Data is a by-product of the application and is inherently siloed.

Emerging new data-centric architectures flip the equation by putting data in the middle, and bringing the business functions to the data rather than copying data and moving it to the function.  This new design pattern acknowledges data’s valuable and versatile role in the larger enterprise and industry ecosystem and treats information as the core asset to enterprise architectures.

Take a step back from building more data-cleaning silos and pipelines, and look to best practices in data-centric architectures that produce future-proofed “golden records” out of the gate that can be applied to any amount of operational or analytical applications. 

There are foundational concepts to data-centric architecture that must be embraced to truly uproot dirty data (and their silos) for good – Security, Trust, Interoperability, and Time – that allow data to become instantly understandable and virtualized across domains. 

Ready to get started? 

Fluree’s data-centric suite of tools provides organizations with a path forward for better, cleaner data management: 

Fluree Sense is the first step in your data transformation journey. Fluree Sense uses machine learning trained by subject matter experts to ingest, classify, and remediate disparate legacy data. Fluree Sense is perfect for getting your existing legacy data into shape.

Fluree Core is your second step. Fluree Core provides a data management platform that enables true data-centric collaboration between any number of data sources and data consumers. Fluree’s core data platform is a source graph database that leverages distributed ledger technology for trust and data lineage, data-centric security for data governance that scales, and semantic interoperability for data exchange and federation across domains. Fluree Sense can feed directly into Fluree Core, allowing your business domains to directly collaborate with your enterprise-wide data ecosystem. 

** Special Thanks to Eliud Polanco for Co-Authoring this post

It was fall 2008, and Eliud Polanco had a problem. The financial crisis had just hit. Polanco, a data & analytics expert, was sifting through endless Excel spreadsheets and mainframe printouts to figure out the risk exposure to the large, global universal bank where he worked. The spreadsheets contained information about the financial products sold and traded, as well as risk forecasts and models produced by quants—quantitative analysts who wrote code for pricing, high-speed trading and profit maximization. Not only was the data within the mainframe reports and spreadsheets complex, but they were spread out around the world.  

This was in one of the largest Banking institutions in the world, with offices in 100+ countries and hundreds of thousands of employees. Branches, departments, and countries had their own systems that were designed to make day-to-day life easier for the functions where people worked in. When you zoomed out and tried to look at all those systems as a whole, however, things grew chaotic. 

One of Polanco’s first steps was to put everything in a data warehouse, which then evolved into one of the largest data lakes at the time. He then methodically tried to query and sort the data using every analysis tool on the market. All of them came up short. 

It wasn’t terribly surprising. Master data management, the ability to view, manage, and analyze all data from a single pane of glass, has always been an elusive goal. Big, complex organizations have data in an array of siloes, from cloud applications like Salesforce to custom-built, department-specific ERPs. Trying to work with all the data in one place requires automation, and that, in turn, leads to unanticipated consequences. Software updates changing data is a common example. Approximately 62 billion hours of data and analytic work “are lost annually worldwide due to analytic inefficiencies,” according to the Data and Analytics in a Digital-First World report.

Polanco decided to build his own master data management platform. The software should help him understand not only the big picture of the organization’s data, but enable him to dig into specifics without getting mired in bug fixes. Wall Street has long been a first mover in AI- and machine learning applications, and Polanco took that innovative mindset into his platform, using unsupervised machine learning to crawl data, and supervised machine learning to ask humans the right questions about that data. 

The result was ZettaLabs, now Fluree Sense. By taking advantage of two kinds of machine learning, Fluree Sense organizes even the most diverse and chaotic data to 90th percentile accuracy. When you send a query, an unsupervised machine learning algorithm crawls data sets in your data lake and aggregates together potential answers. Next, a supervised machine learning algo does entity resolution, grouping together names, addresses, etc and so on that seem to refer to the same person or object. The algo then creates questions for humans to answer, and fires them off to subject matter experts. The experts respond, the algo learns, and Fluree Sense serves up your data on a color-coded plate.

The image shows the fluree sense pipeline.
The image shows the resolution results from running fluree sense.

In a company as complex as the large global Bank, ZettaLabs/Fluree Sense was a boon. Polanco was able to complete the seemingly impossible organization of quant data. He also realized that many other big organizations had chaotic data sets, and built a company around his software. Zettalabs/Fluree Sense went on to be used for fraud and money laundering risk detection as well as for customer-facing purposes, such as new customer acquisition, upsell/cross-sell opportunities, and customer delinquency/churn.

Fluree Acquires ZettaLabs

Fluree has long known that organizations are going from data as a byproduct—where it is stuck in cloud instances and other siloes—to data being the product. Those organizations that can leverage data effectively will come out ahead in the transition to Web3. Designed for Web3, Fluree lets users wrap policy around data for permissions, allow machines to collaborate around data, lets users time travel to verify and validate data during different moments in time, as well as other data-centric features

An ecosystem of startups and enterprise and governmental pilot programs flourished on Fluree. Big, complex organizations with legacy data, however, still had their mess of existing data to figure out. ZettaLabs was designed to figure out a mess of existing data, providing a bridge between legacy data infrastructure and data-centric Web3. 

“Dealing with legacy infrastructure is one of the biggest challenges for modern businesses, but nearly 74% of organizations are failing to complete legacy data migration projects today due to inefficient tooling and a lack of interoperability,” said Fluree CEO and Co-Founder Brian Platz. “By adding the ZettaLabs team and product suite to our own, Fluree is poised to help organizations on their data infrastructure transformation journeys by uniquely addressing all major aspects of migration and integration: security, governance, and semantic interoperability.”

“We developed our flagship product, ZettaSense, to ingest, classify, resolve and cleanse big data coming from a variety of sources,” said Eliud Polanco, co-founder and CEO of ZettaLabs, who will become Fluree’s president. “The problem is that the underlying data technical architecture—with multiple operational data stores, warehouses and lakes, now spreading out across multiple clouds—is continuing to grow in complexity. Now with Fluree, our shared customer base and any new customers can evolve to a modern and elegant data-centric infrastructure that will allow them to more efficiently and effectively share cleansed data both inside and outside its organizational borders.”

A Golden Record for Data 

Digital transformation is a journey that often takes multiple years. Clean data by collecting, integrating, reconciling, and remediating it across the organization is the gargantuan first step—and a massive technical challenge. Most big organizations have multiple databases and operational data stores, much of it containing low-quality data. Even after purchasing data warehouses and governance tools, organizations find their analytics stymied by data quality. 

Then there are cultural challenges. Business units often have their own data stores that are custom configured, often in SaaS software such as a Salesforce instance. Merging that instance into an organization-wide process and workflow is an interruption to daily life—and threatens the KPIs of that business unit. Imagine you’re a salesperson with a quota to reach, and IT comes in wanting to reconfigure your Salesforce for a few months. The resistance is understandable but also delays digital transformation.

Fluree Sense solves the problem by getting data activation-ready within weeks, not months. The machine learning algorithm crawls data lakes, automatically integrating and cleansing data. The combination of supervised- and unsupervised ML quality assures 90% of data, leaving a far reduced human workload. Cleaned and quality-assured data becomes available to the entire organization, accelerating the move to modern data-centric architecture. 

With Fluree Sense, you can: 

For a large enterprise, data migration costs up to $150 million. Fluree Sense can successfully operate at a small fraction of that cost. Organizations can consolidate business data from various stores—sometimes dozens of warehouses—onto the single cloud that is their data lake. Fluree Sense organizes and cleans it in one place, with a small team using low- and no-code software to manage data. Time-to-value from raw data to business consumption is reduced from months to weeks. Data scientists, analysts, and business users can access data through tools such as Tableau, Synapse, and Databricks – or, in the near future, a secure Fluree Core instance for a true ASOT (authoritative source of truth). 

There is also an easy path to Fluree’s Web3 features, which include data audit and compliance solutions, verifiable credentials, data-centric applications, decentralized apps, and enterprise knowledge graphs. By making data usable and freeing it to be applied in Web3 architectures, Fluree Sense is shortening the timeline to digital transformation and making even the most chaotic data architectures Web3-ready. 

To learn more about the Fluree Sense product, sign up for our webinar: Introduction to Fluree Sense.

Grocery store shopping is blissfully easy. The butcher, the produce stand, and the bakery are all in the same convenient place. You can trust what the store provides. You can enter the store in a rush, browse through the meat aisle, grab a sirloin, make sure the expiration date is okay, and drive home.

What if you were to learn that the sirloin was adulterated by mixing beef with horsemeat? Or that the company packaging the meat had slapped a fake expiration date on it? You would understandably feel disgusted. Food should be safe and honest; a label should mean what it says. 

Most countries have traceability protocols, but they are not created equal. Even the most stringent rules, such as the US’s FSMA food safety standards, don’t necessarily keep up with ever-changing consumer preferences. Was that palm oil sourced sustainably? Is the beef kosher or halal? Too often, there is no way to verify if labels are telling the truth. 

For example, the US has a known beef traceabiilty problem – just this year the NFU (National Farmers Union) raised concerns that beef labelled “Product of the USA” might actually have come from foreign countries

Establishing Trust in a Food Supply Chain

“We’ve seen what people try to get away with,” said Jonah Lau, Co-founder and CTO of supply chain-tracing startup Sinisana Technologies. “They modify things like expiration dates, or they mix in horse- or buffalo meat at the supermarket. We created Sinisana to solve those pain points.”

Sinisana is based in Malaysia, the world’s largest producer of halal food. The company recently launched Southeast Asia’s first halal beef tracking system, entering a global market of more than 2 billion Muslims. Sinisana uses Fluree to power a farm-to-fork traceability platform that ensures visibility through an entire beef supply chain. 

There is a technical reason that things like horsemeat can sneak into the beef supply. “It’s common for government food safety agencies to require documentation from one supply chain node to the next—for example, from slaughterhouse to processing plant—but not along multiple nodes,” said Lau. “Our long visibility into the supply chain is unique.”

Halal, an Islamic food law, requires adherents to only eat meats that are permitted (pork, for example, is not). It also sets down laws for what is essentially the meat supply chain, from slaughter to processing and storage. Even if animals are slaughtered in a halal-compliant way, transgressions can sneak in further down the supply chain. 

Concerned about this tendency, one of Malaysia’s most reputable halal beef suppliers approached Sinisana to establish supply chain transparency. The supplier sources cattle from Australia, ships them to Malaysia for fattening, then transports them for butchering. Cattle are butchered at various locations that include meatpacking plants and grocery stores. 

The supply chain, in other words, is not necessarily linear. It sometimes looks more like a tree than a single chain. Thanks to Sinisana and Fluree, however, consumers can trace the origins and movements of every single package of beef via a non-fungible tag with a QR code. 

Immutable Data, Cost-Efficient Pricing, and Simplicity

From the originator onwards, every organization that processes the cattle enters its own data—and cannot change previously entered data. As cattle move from producer to fattener to butcher, all information is preserved, making the data immutable. 

Even as cattle make their way to various locations for slaughter, Sinisana can still see and track data. This is because Fluree natively supports semantic graph technology, enabling full supply-chain visibility even with multiple parties involved. 

 “The great thing about graph is you have context for relationships, so you can go any direction you want,” said Lau. “It makes so much sense for us to run on Fluree. We can integrate all kinds of things that would be very difficult on a traditional blockchain.”

Complex integrations are hard on a traditional blockchain in part because of Gas fees. Transactions are based on mining a cryptocurrency, for example ETH, so users need to pay Gas fees—estimates of compute needs—before every transaction. Fluree’s usage-based model helps keep Gas costs low. Instead of charging Gas fees, Fluree calculates the number of flakes (discrete, semantically interoperable data units) that each transaction is touching. Data storage and interoperability become cost-effective while keeping all the benefits of blockchains. 

“Because Fluree enables a data mesh or fabric, you can do more things off-chain or synchronize side chains,” said Lau. “You aren’t doing as many critical things on the main chain that consume a lot of Gas. We are able to keep traceability costs low.”

Instead of passing along prohibitive Gas fee costs, Sinisana is able to charge on a per unit or per kilogram of meat. 

He describes his stack as “light and intuitive.” Sinisana uses Fluree as a database back-end, with node.js, React, React Native, and Express as other layers. “I used to work with a different data fabric that required a lot of specific skills to build,” he said. “It took time to put together and ended up with many moving parts that could fail. That created complexity for no good reason. Oftentimes we didn’t have enough time to work on it while trying to go to market with our product—it was not an easy monster to manage. Fluree’s model, by contrast, is refreshingly simple.” 

The Truth is in the Data

Sinisana’s low-cost pricing model and farm-to-fork technology continues to draw in other food producers. Suppliers of palm oil, crab, shrimp, and lobster are all working with Sinisana. Halal, however, is far from the only concern. There is a global movement towards greater visibility, particularly when it comes to ending bonded labor, a common employment practice that exploits workers in debt with unfair wages. For its part, Sinisana is in talks with palm oil producers to promote transparency in fare compensation to migrant laborers.

“Consumers are demanding sustainability, and the market is following,” said Lau. “We heartily believe supply chain tracking is the future. It’s hard to believe a company’s claims at face value. However, when one person takes the risk of applying full transparency to their supply chain, the rest are forced to catch up.” 

“The truth,” he said, “is in the data.”