Blog Kevin Doubleday10.05.20

Introduction to Data-Centricity

The Data-Centric Architecture treats data as a valuable and versatile asset instead of an expensive afterthought. Data-centricity significantly simplifies security, integration, portability, and analysis while delivering faster insights across the entire data value chain. This post will introduce the concept of Data-Centricity and lay the framework for future installments on Data-Centricity.

Welcome to Fluree’s series on data-centricity. Over the next few months, we’ll peel back the layers of the data-centric architecture stack to explain its relevance to today’s enterprise landscape.

Data-centricity is a mindset as much as it is a technical architecture – at its core, data-centricity acknowledges data’s valuable and versatile role in the larger enterprise and industry ecosystem and treats information as the core asset to enterprise architectures. Opposite of the “Application-Centric” stack, a data-centric architecture is one where data exists independently of a singular application and can empower a broad range of information stakeholders.

Freeing data from a single monolithic stack allows for greater opportunities in accelerating digital transformation: data can be more versatile, integrative, and available to those that need it. By baking core characteristics like security, interoperability, and portability directly into the data-tier, data-centricity dissolves the need to pay for proprietary middleware or maintain webs of custom APIs. Data-Centricity also allows enterprises to integrate disparate data sources with virtually no overhead and deliver data to its stakeholders with context and speed.

Data-Centric architectures have the power to alleviate pain points along the entire data value chain and build a truly secure and agile data ecosystem. But to understand these benefits, we must first understand the issues of “application-centricity” currently in place at the standard legacy-driven enterprise.

Big Data Valuable Data

The application boom of the ’90s led to increased front-office efficiencies but left behind a wasteland of data-as-a-byproduct. Most application developers were concerned with one thing: building a solution that worked. How the application data would be formatted or potentially reused was secondary or perhaps out of sight. 

Businesses quickly realized that their data has a value chain – an ecosystem of stakeholders that need permissioned access to enterprise information for business applications, data analysis, information compliance, and other categories of data collaboration. So, companies invested in building data lakes – essentially plopping large amounts of data, in its original format, into a repository for data scientists to spend some time cleansing and analyzing. But these tools simply became larger data silos, introducing even higher levels of complexity. 

In fact, 40% of a typical IT budget is spent simply on integrating data from disparate sources and silos. And integrating new data sources into warehouses can take weeks or months – which is a far cry from becoming truly “data-driven.”

In the application-centric framework, data originates from an application silo and trickles its way down the value chain with little context. To extract value from this data is a painful or expensive process. Combining this data with other data is an almost impossible task. And delivering this data to its rightful value chain is met with technical and bureaucratic roadblocks.

These are not controversial claims. According to an American Management Association survey, 83% of executives think their companies have silos, and 97% think it’s having a negative effect on business.

Let’s explore how these data silos continue to proliferate, even after the explosion of cloud computing and data lake solutions:

The Application-Centric Process

Today, developers build applications that, by nature, produce data. 

Middle and back-office engineers build systems to collect the resultant data and run it through analysis, typically in a data lake or warehouse. 

Data governance professionals work to ensure the data has integrity, adheres to compliance, and can be reused for maximum value. 

In order to re-use or share this data with third parties or another business app, it needs to go through processes of replication, cleansing, harmonization, and beyond to be usable. Potential attack surfaces are introduced at every level of data re-use. Complexity constrains the data pipeline with poor latency and poor context.

In other words, data is not armed at its core with the characteristics it needs to provide value to its many stakeholders — so we build specialized tools and processes around it to maximize its analytical value, prove compliance, share it with third parties, and operationalize it into new applications. This approach may have worked for ten or so years – but the data revolution is not slowing and these existing workaround tools cannot scale. Our standard ETL process is slow and expensive, and data lakes become data silos with their own sets of compliance, security, and redundancy issues.

But there is a better way – a path to data-centricity – that flips the standard enterprise architecture on its head and starts with an enterprise’s core asset: data. 

The Shift to Data-Centricity – Why Now?

Industries are moving towards data ecosystems – an integrative and collaborative approach to data management and sharing. Here are just a few examples:

  • Data-driven business applications today touch many internal and external stakeholders (sales, HR, marketing, analysis, compliance, security, customers, third parties, etc.). There is a clear need to collaborate more effectively and dynamically on data that powers multiple applications across multiple contexts. 
  • Enterprises are building (many for the very first time) a master data management platform for a 360-degree-level view of their master data assets. This can be as simple as building a “golden record” customer data repository to cut down on redundancies in data silos and implement more centralized data access rules. Next-generation MDM solutions are now making their “golden record” repositories operational – where their master data repositories directly power applications and analysis from the same source of truth. Data-centricity is essential here.
  • Enterprises are creating data “knowledge graphs” that link and leverage vast amounts of enterprise data under a common format for maximum data visibility, analytics, and reuse. 
  • More advanced enterprises are building “data fabrics,” a hyper-converged architecture that focuses on integrating data across enterprise infrastructures. Data Fabrics (in theory) provide streamlined and secure access to disparate data across cloud and on-prem deployments in an otherwise complex distributed network environment. 
  • Enterprises are realizing the value of “data marketplaces,” where “golden record” information can be subscribed to within a data-as-a-service framework. 

To accommodate these data-driven trends, we need to build frictionless pipelines to valuable data that is highly contextual, trusted, and interoperable. And we need to answer emerging questions around data such as: 

  • Data Ownership: Who owns the data, and how is privacy handled?
  • Data Integrity: How do we know the data has integrity?
  • Data Traceability: Who/When/How was it originally put into the system? How has that data changed over time, and who has accessed that data?
  • Data Access Permissions: Who should be able to access the data or metadata, and under what circumstances? How can we change those security rules dynamically?
  • Data Explainability: How do we trace back how machines arrived at specific data-driven decisions?
  • Data Insights: How can we organize our data to maximize value to its various stakeholders?
  • Data Interoperability: How do we make data natively interoperable with machines and applications that reside within and outside of our organization?

Fluree: The Data-Centric Stack

Fluree is a data management platform that extends the role of the traditional database to answer these above questions. Fluree breaks its core “data stack” into 5 key areas: trust, semantic interoperability, security, time, and sharing.

This image shows a stacked circle graph outlining the key points of Fluree Core Architecture. In the middle is Trust, followed by Semantics, Security, Time and Shareability.

Is Fluree a database, a blockchain, a content delivery network, or a more dynamic, operational data lake? 

It seems we could silo off Fluree as a technology to fill any of those roles, but its value should be realized in the greater “data value chain” context. Fluree’s data-centric features work together to enable the data environment of any CIO’s, CTO’s, and CDOs’ dreams. Secure data traceability and audit trails. Instant knowledge graph with RDF semantic standards. Blockchain for trusted collaboration. Scalable graph database with in-memory capabilities to power intelligent, real-time apps and next-generation analysis.

But these concepts can feel overwhelming, especially for a business that has always worked in silos of data responsibility. So, we decided to conceptualize each component to the data-centric stack in this 5 part series. Check out part 1 on “Data-Centric Trust.”

This graphic shows four circles, each with a letter that spells out the word "Fair". Below the circles is the word represented by each letter of the word fair: Findable, Accessible, Interoperable, and Reusable.


Introduction 

In 2016, Scientific Data published a paper titled The FAIR Guiding Principles for scientific data management and stewardship, a call-to-action and roadmap for better scientific data management. The mission was clear: provide digital information with the characteristics it needs to be found, accessed, interoperable, and reused by communities after initial data publication. FAIR principles allow data and metadata to power insights across its many stakeholders and evolve into dynamic sets of shared knowledge: 

Bringing FAIR to the Enterprise

As noted by the initial paper, FAIRness was born out of the data management pains apparent in the scientific research community — specifically the need to build on top of existing knowledge and securely collaborate on research data. But the FAIR principles should now be considered for highly-sophisticated master data management across various industries, especially as enterprises begin to invest heavily in extracting analytical and operational value from data.

Today’s enterprise data has multiple stakeholders – front-office applications that generate data, compliance departments, ERP systems, cybersecurity, back-office analysts, and emerging technology projects like AI, among many others. With this degree of layered collaboration, FAIR data principles should be implemented to extend data’s utility across the data value chain and provide an enterprise-wide source of truth.

Building FAIR data directly in as an immediate requirement across all data assets might seem like an extreme upfront investment versus your standard data management procedure of hiring data scientists to extract value from old data, but as semantic data expert Kurt Cagle so eloquently states in a recent publication titled “Why You Don’t Need Data Scientists”:

“You have a data problem. Your data scientists have all kinds of useful tools to bring to the table, but without quality data, what they produce will be meaningless.” 

Kurt Cagle

FAIR data is about prescribing value to information at the source of creation, rather than as a messy afterthought of harmonizing, integrating, and cleansing data to be further operationalized in yet another silo.

With data integration, compliance, and security as the top items consuming the typical IT budget, perhaps starting with FAIR data is worth the upfront investment. And as superior Master Data Management becomes a competitive edge in 2020, FAIR data principles must be considered beyond the scope of the research industry.


So, what makes data “F.A.I.R.”?

FAIR data principles not only describe core standards for data and metadata, but also for tools, protocols, and workflows related to data management. Let’s dive in: 

(Thanks to go-FAIR: https://www.go-fair.org/fair-principles/)

The image shows the "F" character from the "FAIR" data graphic. "F" stands for Fair.

Findable

Findability is a basic minimum requirement for good data management but can be often overlooked when rushing to ship products. In order to operationalize data, humans and machines must first be able to find it — this comes down to utilizing rich, schematically-prescribed metadata to describe information so consumers know where and how to look.

  • Data has persistent identifiers  – A unique identifier for every data element is essential for humans and machines to understand the defined concepts of the data they are consuming. 
  • Rich metadata supports dataset – Provide as much supporting metadata as possible for future stakeholders to find it via linked data. Rich and extensive details about data might seem unnatural to the standard database programmer but can seriously boost findability downstream as your web of data assets grows. 
  • Metadata accurately describes dataset – A simple yet important rule for scaling knowledge: metadata’s relationship to a dataset should be defined and accurate.
  • Metadata is searchable – Metadata must be indexed correctly in order for data to be a searchable resource.

Accessible 

Once data can technically be found, it must now be accessible to its various stakeholders at the protocol layer,  with data access controls baked in. 

  • Data uses a standard communication protocol – Data should be able to be retrieved without proprietary protocols. This makes data available to a wider range of consumers.
  • The protocol is open and free – Reinforcing the above, the protocol should be universally implementable and non-proprietary. (These may change depending on open/closed enterprise environments, but most data must scale outside enterprise borders. )
  • Data authentication and permissions can be set –  Data ownership, permissions and controls should be comprehensively integrated into FAIR data strategies – because data is technically accessible to all via an open protocol, we must ensure a scalable Identity and Access management system. (Note: Take a look at Fluree’s Data-Centric Security here!)
  • Metadata is always available, even without dataset – Metadata must persist even in the event that data is deleted in order to avoid broken links. Metadata is an incredibly valuable asset and should be treated as such.
The image shows the "A" character from the "FAIR" data graphic. "A" stands for Accessible.
The image shows the "I" character from the "FAIR" data graphic. "I" stands for Interoperable.

Interoperable

Data must be primed for integration with other data sets and sources. Data must also be consumable via any standards-based machines.

  • Data is easily exchanged and interpreted via standard vocabularies – Data must be consumable for both machines and humans, therefore it is imperative to represent it within a universally-understood data ontology and within a well-defined data structure. For example, Fluree stores every piece of data and metadata in W3C-standard RDF format – an atomic triple readily consumable by virtually every machine.
  • Datasets and Metadata are all FAIR – The vocabularies that govern these principles must adhere to FAIR principles.
  • Metadata is meaningfully linked to other metadata – In order to extend the value of linked data, the relationship between metadata should be cross-referenced in rich detail. This means that instead of stating “x is associated with y” one might state “x is the maintainer of y”

Reusable

The first three characteristics (FAI) combine with these last properties to ultimately make data reusable as an operational resource for apps, analytics, data sharing, and beyond.

  • Metadata is described using rich attributes – Data must be described with a plurality of labels, including rich descriptions of the context in which the data originated. This allows for data not only to be found (see F), but also understood in context to determine the nature of reuse.
  • Metadata is available under an open-usage license – In order to be re-used, data must come with descriptive metadata on its usage rights.
  • Historical Provenance is associated with Metadata – We must understand the full provenance and path of data in explicitly published traceability in order to effectively reuse data and metadata. For example – Fluree automatically captures traceability and can output a complete audit trail of data – including details of origination and path of changes tied to unique user identities in an immutable chain of data events. 
  • Data matches common community standards –  If relevant, data must adhere to domain-specific standards. For example, the FHIR standard for healthcare interoperability or GS1 standards for supply chain interoperability.
The image shows the "R" character from the "FAIR" data graphic. "R" stands for Reusable.
This image breaks down the "Fair" graphic even further, into the specifics of the role of each letter. Under "F", or "findable", data has persistent identifiers, rich metadata supports the dataset, metadata accurately describes the dataset, and metadata is indexed and searchable. Under "A", or Accessible, Data uses standard communication protocol, protocol is open and free, data authentication and permissions can be set, and metadata is always available, even without dataset. Under "I", or "interoperable", data is easily exchanged and interpreted via standard vocabularies, datasets and metadata are all FAIR, and metadata is meaningfully linked to other metadata. Under "R", or "reusable", metadata is described using rich attributes, metadata is available under an open usage license, historical provenance is associated with all data, data matches common community standards.


Taking the Fluree Approach

Fluree’s semantic graph database expresses information in W3C standard RDF formatting and extends metadata capabilities to comprehensively satisfy F.A.I.R. requirements: 

To most people, databases are simply tools built to feed an application. But at Fluree, we are reimagining the role of the database to acknowledge functions beyond data storage and retrieval. Fluree is not just a read-only data lake, but also a transactional engine. In essence, Fluree’s data platform can build sets of FAIR data, enforce ongoing compliance to FAIR data standards with governance-as-code,  and power consumers of FAIR data for read/write applications. 

More on Fluree’s Data-Centric features here: https://flureestg.wpengine.com/why-fluree/

Further Reading on FAIR Data Principles: 

What if machines could talk to one another?

In 1999, the inventor of the World Wide Web Sir Tim Berners Lee expressed a vision for an intelligent, connected, and data-driven Internet:

“I have a dream for the Web in which computers become capable of analyzing all the data on the Web — the content, links, and transactions between people and computers.”

Tim Berners Lee

Sir Tim Berners Lee’s vision became popular among emerging technologists who understood the value of universal data — interconnected, versatile, and directly available as a globally decentralized asset.

Core Philosophies of Web3 (The Semantic Web)

What does the Web3 look like?

Web3 aligns with a data-first approach to building the future.



Baby Steps to Adoption

The image represents a standardized atomic unit of data.
RDF is a standardized atomic unit of data

The Web3 vision never fully manifested itself as internet giants took control over and centralized the gold rush of data in the 2000s. It did, however, spark a fervorous following in standardization (RDF, SPARQL, OWL) with the goal of making all data on the web ‘machine-readable.’ 

This led to linked enterprise data initiatives with knowledge graph technology and connected data research around in healthcare, industry, and information science areas. Standardization also generally impacts the internet as meta-data-related activities on the web are in RDF (serialized in XML) for describing data to machines such as search engines.

But the true semantic web — a universally decentralized platform in which all data were to be shared and reused across application, enterprise, and community boundaries — hasn’t taken complete form. And with emerging applications in machine learning and artificial intelligence, a semantic web of information readable by machines is the obvious next step.

So why haven’t we moved completely to a Web 3.0 framework?

Answer: Trust, at scale


The Semantic vision in many ways mirrors the rhetoric of today’s conversation around decentralization (most commonly as a defining mechanism and philosophy in crypto economics). It lays out a powerful vision for cross-boundary collaboration, third-party transactional environments with no middlemen, and a ‘democratization’ of power. A true open-world concept. 

However, in order to truly facilitate secure data collaboration across entities, the fundamental issue of trust became a massive hurdle. How are we expected to openly expose information to the world if it could easily be manipulated?

Tim Berners-Lee's layer cake of enabling Semantic Web standards and technologies.

In the Tim Berner Lee “Layercake” image above, the large box dedicated to “crypto” wasn’t being truly filled in order to accomplish the “proof” and “trust” components that would bring the semantic dream to fruition.

Juxtaposed against massive companies centralizing data and beginning to use it as their own means of revenue generation, the semantic web stayed largely a vision with a very dedicated niche following.

Enter: Cryptography and Trust

In early 2009, Bitcoin introduced us to a very powerful concept: In Code We Trust. Via the combination of ordered cryptography and computational decentralization, Bitcoin showed the world that we could in fact inject trust into exposed information in an open transactional environment.

Immutability and tamper-resistance, provided by advanced cryptography, became the centerpiece of discussion around blockchain’s applications in various industries. And in many ways, it had the power to solve the trust gap that the web3 needed to close in order to move to mass adoption. A technology for securing, storing, and proving the provenance and integrity of information, combined with data standardization and semantic queries would mark an incredible step towards a more intelligent web framework.

| Additional reading: Why Blockchain Immutability Matters

Still, in the early days of blockchain technology, the application focus tended to land heavily on recording transactions. Specifically, public chains such as Ethereum and Bitcoin were excellent means of accomplishing asset movement between parties.

The machine-readable web requires more: in order to query and leverage data as a readable and malleable asset to power applications, blockchain had to manage all data in a usable format for applications. At Fluree, this is what we called “Blockchain’s Data Problem.”

Blockchain’s Data Problem

“We looked at blockchain a few years ago, but ultimately found it too complex for us to work into our business architecture.”

Blockchain was originally designed to facilitate peer-to-peer banking at scale – which required no or minimized data management capabilities. But when enterprises began their blockchain journeys, they found it difficult to retrofit the same first-generation blockchain technology into their existing technology stacks — primarily because most enterprise applications require sophisticated data storage and retrieval. For example, a supply chain application produces and pulls data in a variety of ways: purchase orders, store IDs, RFID inputs, and more. 

Building blockchain applications that handle metadata to this level of sophistication is challenging from a development perspective as well as an ongoing integration management standpoint – and the overhead required in aligning this information to make it operational downstream for applications is near impossible to justify.

So, most folks just stick their data and metadata in a regular old database and use blockchain as yet another data silo. The lack of holistic data management defeats the original purpose by adding cost and complexity to the systems that this new technology was meant to simplify.

Enter: Fluree

Fluree solves this data integration problem as a ground-up data management technology, allowing developers to build a unified and immutable data set to power and scale their applications. No sticky integrations or extra layers — just one, queryable data store optimized for enterprise applications.

Fluree focuses on a blockchain-backed data management solution that brings cryptography and trust to every piece of data in your system.

By embracing semantic standards as a core component of storage (RDF) and retrieval (SPARQL, GraphQL), Fluree brings trust to the semantic web all under one scalable data management platform. 

Fluree’s architecture is comprised of semantic standards, enterprise database capabilities, and blockchain characteristics that bring interoperability, trust, leverage, and freedom to data. It is truly the web3 data stack: 

The diagram breaks down fluree architecture into its 5 main components with trust at the focal point, then semantics, security, time, and shareability.


Watch More:

We don’t have to be industry analysts to know that better data equals better decisions. 

Today, we’ll look at 5 ways blockchain technology can improve enterprise data quality and data management to make a lasting impact on a company’s operational success.

1: Unified Master Data Management

Working towards a distributed, verified system of record for better master data management.

Good master data management is part technology and part strategy – it involves building a comprehensive data management framework wherein critical data is integrated and leveraged as a single point of reference across the digital enterprise.

But in many cases, the enterprise is riddled with data silos, resulting in duplication of records and inaccurate or incomplete sets of information. These visibility issues are multiplied with the complexity of new business activity; for example, a merger or acquisition where sales, marketing, HR, financial and customer data is inconsistent in format, duplicated, and needs serious integration processes. 

Blockchain technology can provide a single source of data truth for the data-driven enterprise – as one trusted and distributed system of record. We can crush data silos by logging master data on a distributed blockchain ledger and allowing our suite of applications to connect to and leverage this validated information. 

In cases of multiple stakeholders (i.e. separate applications, separate business units, or even separate corporate entities), a distributed blockchain allows credentials/based access management for secure collaboration on master data. If we can trust and verify the data at its operational layer, we can build frictionless data pipelines to its relevant consumers. 

2. Instant Data Compliance 

The auditors are back – and it’s time to scramble to reproduce financial reports and email threads. 

For many highly-regulated industries (Aerospace, Financial Services including Insurance and Banking, Healthcare), this scenario is a quarterly reality. And in all of these cases, the intended outcome is to reproduce data alongside a timestamp to prove compliance. 

Blockchain technology provides turnkey compliance to every record in a system: by cryptographically hashing every transaction to the prior transaction, a blockchain database provides a native audit trail of verified transactions. This allows you to mathematically prove the legitimacy of data, its original source, its timestamp, and its complete historical path of changes through time.

In fact, the Fluree system natively provides “Time Travel Query,” a powerful query expression that can reproduce — and prove– any state of the database down to the millisecond. It’s like a highly-tamper proof git, but for all of your data. 

3. Data Provenance and Replayability 

Where did that data come from? 

Sometimes databases are more like black boxes than sophisticated systems of record: update-and-replace with no notion of temporal state or original provenance. The average enterprise deploys 200+ SaaS apps – each with its own ‘black box.’ Furthermore, that ‘black box’ data is often replicated and ingested into new systems or data lakes for analysis and further operational application; visibility is eroded at every stage of reuse.  In many cases, our mission-critical applications are leveraging data that, quite frankly, cannot be traced back to its original source.

Blockchain brings end-to-end traceability to every piece of data in your system — a guaranteed chain-of-data-custody. This means you instantly have visibility to data’s origination (SaaS or user) and you can track its complete path of changes across its lifecycle. 

With Fluree, the traceability of changes are tied to digital signatures — including even third-party applications that sit between clients and data. How’s that for visibility? 

4. Data Democratization Across Borders

Deliver trusted data across borders.

In today’s enterprise, information needs to traverse boundaries. Healthcare patient records need to be piped between carriers and hospitals. Government institutions need to securely share records between various agencies. Insurance carriers need to collaborate around claims data and contracts. Data is more useful and effective in the hands of many.

Blockchain technology can provide a trusted foundation for secure, real-time data collaboration across boundaries (application, organization, or even industry borders). With a unique structure of data ownership where various stakeholders have direct visibility into digital records, and even a vote on their validity, blockchain allows for every stakeholder across the data value chain to contribute to trusted sets of shared information. With real-time data visibility, stakeholders can more efficiently optimize business processes and make dynamic decisions.

5. Explainable AI; Trusted Machine Learning

If machines are going to be making more of our business decisions, shouldn’t they have access to trusted, tamper-proof data?

This particular use case might not be highly-relevant to enterprises that have not dipped the proverbial toe into production machine learning. But for those that have: the idea of explainable AI, or understanding the path of decision-making, is important. In extreme cases, like autonomous vehicles, explainable AI is essential. To bring back a metaphor from above, machine learning architects need a better way to explain results apart from opening the proverbial ‘black box.’

As more and more machine learning implementations make their way into the operational world, preserving data integrity and provable history will be a top priority; with more production deployments there will be more adversarial attacks.

Blockchain’s cryptography provides data with tamper-resistance for complete traceability into (a) how it got there and (2) its chain of updates through time. If we can arm the sources powering machine learning applications with this level of data integrity, replayability, and traceability, we can better understand causation in our models and protect them against “data poisoning” or other versions of data manipulation.

Fluree’s temporal dimension to data certainly provides this guarantee of instant and comprehensive replayability. Read more about time travel here.   

Conclusion 

At the end of the day, no matter which data management technology solutions you are deploying, better data equals better business decisions. Leaning into blockchain technology can increase the quality of your data and augment its value across the enterprise. 

By arming data with native traceability, interoperability, context, and security — Fluree is enabling a new class of data-driven applications. Read more here, or sign up for free here!  



It’s a rainy Sunday afternoon, and my friend, Alice, and I are playing “Where’s Waldo.”* This page is a particularly challenging one, and we’ve been looking for nearly twenty minutes.  I’m about ready to give up, when- finally- I spot Waldo, and I yell out, “I got him!”

“Oh really? Prove it then,” Alice says, hand on her hip.

I raise my finger, and I’m about to point to Waldo’s location, but she stops me, exclaiming, “Wait, wait, wait! Wait just one second! I’ve put so much work into finding Waldo, and I don’t want you to ruin it for me.”

“Hmm…” I say. “If only there was a way for me to prove to you that I know where Waldo is without revealing his location.”

“Here, I have an idea,” Alice says. “I’ll make a photocopy of this page in the book, and you can cut Waldo out of the photocopy and show him to me. That way, I know for sure that you know where Waldo is, AND I still won’t know where he is.”

As you might have guessed by the title of this blog, the scenario I’ve described is a classic example of a real-life zero knowledge proof. In this story (which **definitely** happened in real life), I am proving that I possess a piece of information (Waldo’s location), while at the same time not revealing that information.

There are many types and potential applications of zero-knowledge proofs, including nuclear disarmament (yes, really) and verified anonymous voting. In this example project, I’ve focused on how using Fluree in conjunction with zero-knowledge proofs can begin to tackle the challenge of traceable fishing. 

Before we begin, I need to make the disclaimer that I am, by no means, an expert in zero-knowledge proofs. I welcome and encourage comments, suggestions, corrections, and more in the comments. This project is an exploration of the possibility of using zero-knowledge proofs with Fluree.

fishing boat


With three billion people relying on fish as their main source of protein, oceans are a critical food source. Given the importance of marine life to human nourishment, efforts have been made to regulate fishing locations, methods, and quantity. These regulation attempts are sometimes at odds with the desires of fishers who, among other concerns, are not too keen to reveal their fishing locations. This is where zero-knowledge proofs can come in.  Zero-knowledge proofs can allow a fisher to prove that they are, for example, fishing within an allowed area, without exposing the exact location.

To follow along with this example, you can download the fluree/legal-fishing repository. If video is more your speed, you can also check out this project’s accompanying video above or on youtube.

Iden3’s Circom and SnarkJS


In this project, I use two Javascript libraries, both published by Iden3, to handle my zero-knowledge proofs. I use circom to write and compile an arithmetic circuit (more on this in a second). I then use this circuit in conjunction with the snarkjs library to implement a zero-knowledge proof. (For those who are interested, the specific type of zero-knowledge proof we use here is called a zkSNARK, or zero-knowledge succinct non-interactive argument of knowledge, using the 8points protocol). 

The intuition behind this type of zero-knowledge proof is not as simple as the “Where’s Waldo” example, but it operates in a similar way. We first build a very specific type of electrical circuit that is only satisfied by inputs that match certain restrictions.

Image representation of an electrical circuit

Using circom, we’ll create a representation of a circuit, but let’s imagine for a second that it’s a real-world circuit. You and I sit down, and we build an electrical circuit with 5 switches on one side. The circuit has a light bulb that will always light up if exactly 3 of the switches are on- it doesn’t matter which three.


I leave the room, and you flip any three of the switches that you want. You then cover up the switches with a cardboard box so that no one can see them. I walk into the room, see the light bulb is shining, and I know that the circuit is complete. I know that your input (which switches you turned on) fit our criteria (exactly 3 switches turned on), but I won’t know exactly what your input was.

This type of zero-knowledge proof is analogous to what we are mathematically accomplishing using circom and snarkjs. We won’t delve into the math here, but hopefully this has given you some intuition.

InRange.circom


The specific type of circuit we are creating is an arithmetic circuit, which is a circuit that can perform some arithmetic operations. If you are following along with the GitHub repository, the circuit is `src/circuits/InRange.circom`. Our circuit will take two public inputs:

  1. latitudeRange – an array of two numbers representing the min and max latitudes of the legal fishing range (must be positive numbers). 
  2. longitudeRange  – an array of two numbers representing the min and max longitudes of the legal fishing range (must be positive numbers). 

These two inputs will be visible to the public. This circuit will also take a private input:


The circuit will output a 0 (if the circuit is not satisfied) or a 1 (if it is). To compile the circuit, you’ll need to have circom installed:

npm install -g circom

Then you’ll need compile the circuit:

circom InRange.circom -o InRange.json

This will take InRange.circom as an input an output InRange.json in the same directory.

Setup


Now, we need to setup the circuit. In order to do this, we’ll need to have snarkjs installed.

npm install -g snarkjs

And then we can issue:

snarkjs setup -c InRange.json


This will create two files, proving_key.json and verification_key.json. As the names suggest, the proving key is the key you’ll need to prove that your input (your location) is valid. The verification key is the key you’ll need to verify anyone else’s proofs. When we set up our Fluree ledger, we’ll be putting both the proving and verification key (as well as the circuit) on the ledger.

Note – This type of zero-knowledge proof requires a trusted setup. The process of generating these keys will also create some toxic data that must be deleted. Participants need to trust that the toxic data was deleted. It is important to note that this toxic data would allow an untrustworthy participant to create a fake proof using inputs that don’t match the constraints. The toxic data would NOT allow a user to discover someone else’s secret location. 

But this toxic data is only created once, and there are methods to minimize the risk. For example, a multi-party trusted setup creates a situation where a number of participants come together to generate the proving and verification keys, and each of them possess a piece of toxic data. In a setup like this, the only way to create a fake proof would be if every single party was untrustworthy, and they all kept their toxic data and then colluded by bringing their toxic data together (hopefully an unlikely occurrence!).

Calculating a Witness


Before we can create a proof, we need to calculate all the signals in the circuit (including all the intermediate inputs) that match the circuit’s constraints. In order to do this, we’ll need to create an input.json, which has all of our inputs (including the private inputs). Neither the inputs.json, nor the witness.json files will be shared with anyone, but we do need to calculate them first.

Our input file needs to be a map, where the keys are the names of all of the circuit’s inputs, and the values are our specific inputs. For example:

{
    "latitudeRange": [ 20, 21],
    "longitudeRange": [ 176, 190],
    "fishingLocation": [ 20, 180]
}


We can then calculate the witness, which will generate the witness.json file. 

snarkjs calculatewitness -c InRange.json


Create the Proof


Now, we have all of the pieces to create the proof:

snarkjs proof


This command uses the proving key and the signals in witness.json to generate a proof.json (the actual proof) and public.json, which is a subset of your witness.json containing only your public inputs and the outputs.


Verify the Proof


You can now give any other party your verification_key.json, proof.json, and public.json, and they can verify that you put in an input that matched the constraints (a location within the legal range).

Connecting this with Fluree


In the fluree/legal-fishing repo, we not only have an example circuit with example keys and inputs, but we also have a small demo React app that makes it easy to connect this zero-knowledge proof to a Fluree instance.

To get this running, you’ll need to:

  1. Download a fresh instance of Fluree. I used version 0.11.5.
  2. Run Fluree (./fluree_start.sh). Fluree needs to be running on port 8080. This is the default port, so if you didn’t change any of the settings for version 0.11.5, everything will be all set for you. 
  3. Create a database called legal/fishing.
  4. Issue a transaction to create the schema. You can find this transaction in seed/schema.json.
  5. Issue a transaction to upload the circuit, verification key, and proving key to Fluree. You can generate and upload your own, or you can use the example in ‘seed/seed/json’. If uploading your own, be very careful to copy the keys and circuits exactly, otherwise, the proof and verification won’t work.

Now, you can run `npm install` and `npm start` to start up the lightweight React app, which integrates Fluree with the zero-knowledge proofs.



You can use the app to generate a proof and submit the proof and public signals to the Fluree ledger.

You can also click on the Verify Proofs pages to see all the proofs that have been submitted to this ledger. You can click on “Verify Proof” to verify any given proof. Note that verifying a proof takes a little while, so expect to wait 10 – 20 seconds before a green “Verified!” alert comes up. For a full tour of the application, as well as a visual walk-through of getting the circuit and app setup, check out this project’s accompanying video.

Additional Considerations


This small project is only a tiny part of the puzzle needed to ensure seafood traceability. For starters, this example only deals with a single rectangular-shaped area. A real-life project would, of course, be much more complicated than this. In the case of zero-knowledge proofs, verifying and creating proofs can be time-intensive, so implementing a real-world project would require careful consideration of timing. There are zero-knowledge proofs that specifically are optimized for range-proofs, which might be a better fit for this example. This could be an area of future exploration for us.  

Additionally, even if the proof itself took the full scope of real-world restrictions into account, a fisher’s location at the time of catch would have to be reported by a source that is reliable. For example, we might want a piece of hardware that is sufficiently tamper-proof reporting a fisher’s location, rather than, say, the fisher’s word. We would also need a reliable way to correlate a GPS location to a particular catch. For hardware, considerations of cost, hassle to the fishers, and tamper-proofness would all have to be weighed. 

A final area to consider is public knowledge and trust of zero-knowledge proofs. Even if mathematically, we can show that a zero-knowledge proof does not reveal a fisher’s location, the fisher would have to trust the organization implementing this system. The fisher would first have to trust that their location is not hidden somewhere in the proof they are uploading to the database. The proof is a large, JSON object that could conceivably hide information. The fisher would also have to trust that the hardware they are using to report their location is not sending it out through some backdoor. 

These are assuredly not insurmountable concerns, but they should be considered as food-for-thought. Research, implementation, and public understanding of zero-knowledge proofs have really grown in the past few years due to projects like ZCash, so this is definitely an area to look out for!

Thanks everyone for reading, and I’m interested in any and all feedback. If this piqued your interest, you might be interested in checking out other projects that tackle the challenge of proof of location, as well as this curated list of zero-knowledge proof content. You can also get started with Fluree here, and make sure to check out our documentation as well! 

* For those unfamiliar with the “Where’s Waldo” books (or “Where’s Wally” outside of North America), Waldo is a cartoon man in a red-and-white striped shirt. “Where’s Waldo” books have page after page of hectic scenes, filled with people and colors. The object of the game is to try and spot Waldo. 

k


There seems to be a lack of — ahem — *consensus* around blockchain’s definition.

While the philosophical spectrum ranges from satoshi minimalists to mass enterprise adopters, it is important to first understand the technical componentry of blockchain in order to deploy it as a useful application. Here is a “feature-first” definition that may be useful in understanding these technical underpinnings:

The image defines blockchain as being a distributed digital ledger that uses beer to peer consensus within a decentralized network to validate transactions and a hashing algorithm to cryptographically link them in a chronological "chain" of records. The image then breaks that definition into 9 different parts to further elaborate on the definition of blockchain and its elements.



Today, we’ll dive into immutability, a core defining feature of blockchain.

Across the hundreds of articles and conversations around Blockchain, you’ll find the term “immutable” almost always present. Immutability — the ability for a blockchain ledger to remain a permanent, indelible, and unalterable history of transactions — is a definitive feature that blockchain evangelists highlight as a key benefit. Immutability has the potential to transform the auditing process into a quick, efficient, and cost-effective procedure, and bring more trust and integrity to the data businesses use and share every day.

We spend trillions of dollars on cybersecurity solutions meant to keep outside prying eyes from accessing our sensitive data. But rarely do we fight the internal cybersecurity battle: ensuring that our data has not been manipulated, replaced, or falsified by a company or its employees. In many cases, we have come to simply trust that the data is correct by methods like private keys and user permissions. But in reality, we cannot prove — methodically or mathematically — that information in a standard application database is unequivocally tamper-free. Auditing becomes our next (and expensive) line of defense.

Blockchain implementation can bring an unprecedented level of trust to the data enterprises use on a daily basis — immutability provides integrity (both in its technical and primary definition). With blockchain, we can prove to our stakeholders that the information we present and use has not been tampered with, while simultaneously transforming the audit process into an efficient, sensible, and cost-effective procedure.

How Immutability is Achieved

A Brief Introduction to Cryptography and Hashing

Before we dive into blockchain immutability, we’ll need to understand cryptographic hashing. Here are the basics:

The image shows a SHA-256 Hashing Machine.
SHA-256 Hashing Machine


Want to test out some basic hashing? Here is a free Sha-256 hash calculator: http://www.xorbin.com/tools/sha256-hash-calculator

Cryptography + Blockchain Hashing Process = Immutability

Each transaction that is verified by the blockchain network is timestamped and embedded into a “block” of information, cryptographically secured by a hashing process that links to and incorporates the hash of the previous block, and joins the chain as the next chronological update.

The hashing process of a new block always includes meta-data from the previous block’s hash output. This link in the hashing process makes the chain “unbreakable” — it’s impossible to manipulate or delete data after it has been validated and placed in the blockchain because if attempted, the subsequent blocks in the chain would reject the attempted modification (as their hashes wouldn’t be valid). 

In other words, if data is tampered with, the blockchain will break, and the reason could be readily identified. This characteristic is not found in traditional databases, where information can be modified or deleted with ease.

The blockchain is essentially a ledger of facts at a specific point in time. For Bitcoin, those facts involve information about Bitcoin transfers between addresses. The below image shows how the checksum of transaction data is added as part of the header, which, in turn, is hashed into and becomes that entire block’s checksum.

The image shows the blockchain hash process for immutability.
Blockchain Hashing Process for Immutability

Benefits, Explained


Why does immutability matter? For the enterprise, immutability as a result of blockchain implementation presents a serious overhead saver as well as simplified auditing efforts & fraud prevention. We’ll break these concepts down:

Complete Data Integrity — Ledgers that deploy blockchain technology can guarantee the full history and data trail of an application: once a transaction joins the blockchain, it stays there as a representation of the ledger up to that point in time. The integrity of the chain can be validated at any time by simply re-calculating the block hashes — if a discrepancy exists between block data and its corresponding hash, that means the transactions are not valid. This allows organizations and its industry regulators to quickly detect data tinkering.

Simplified Auditing — Being able to produce the complete, indisputable history of a transactional ledger allows for an easy and efficient auditing process. Proving that data has not been tampered with is a major benefit for companies that need to comply with industry regulations. Some common use cases include supply chain management, finance (for example, Sarbanes-Oxley disclosures), and identity management.

Increase in efficiencies — Maintaining a full historical record is not only a boon to auditing, but also provides new opportunities in query, analytics, and overall business processes. FlureeDB, for instance, takes advantage of the concept of time travel for business applications — where queries can be specified as of any block — or point in time — and reproduce that time’s version of the database, immediately.

This capability allows for a host of time and cost savings — including tracking the provenance of major bugs, auditing specific application data, and backup and restoring database state changes to retrieve information. Immutability can make the most modern-day data problems that plague enterprise applications irrelevant.

Proof of Fault — Disputes over fault in business are all too common. The construction industry accounts for $1 Trillion dollars in losses as a result of unresolved disputes. While blockchain won’t wholly dissolve this massive category of legal proceedings, it could be leveraged to prevent a majority of disputes related to data provenance and integrity (essentially proving who did what and at what time).

Blockchain finality allows us — and a jury — to fully trust every piece of information.

Fluree secures every transaction — proving who initiated it, when it was completed, and that it is free of tampering.

The image shows the fluree transaction hashing process.
Fluree Transaction Hashing Process

It even tracks the changes a SaaS vendor makes to your transaction before it reaches the data storage tier, meaning you can trust your SaaS data without fully trusting your SaaS vendor:

This image breaks down fluree data integrity into 5 different parts that reveal how fluree works to secure every data transaction.
Fluree Data Integrity

The Asterisk to Immutability

Immutability ≠ Perfect, Truthful Data

Blockchain is a mechanism for detecting mathematical untruths, not a magical lie detector.

Blockchain doesn’t inherently, automatically, or magically make data truthful — its implementation merely cryptographically secures it so that it will never be altered or deleted without consequence. Measures such as sharing your hash outputs directly with stakeholders (customers, auditors, etc.) or setting up a decentralized network of validation nodes are a good complement to the historical immutability the blockchain hashing process provides, to ensure an often-needed validation component.

In addition: the stronger the enforcement rules, the more reliable the data on the blockchain (Exhibit A: Bitcoin’s proof of work).

FAQs on Immutability


What are the disadvantages of immutability? How can they be avoided?

Having an unalterable history of transactions seems like the answer to many modern business problems — and it is, in many ways. But what happens when sensitive data — like employee’s home addresses — is accidentally published to a blockchain? Hopefully, this wouldn’t be an issue, as standard design decisions in building blockchain environments necessitate a separation between sensitive and personally-identifying information.

If running a private, federated blockchain, your company would have to convince the other parties to agree to a “fork” in the blockchain — where a blockchain splits into two paths and the new designated database continues on. All or most parties involved in this blockchain will have to agree on the terms including which block to fork at and any additional rules of the new database. If this blockchain is truly public, it is next to impossible to have this information removed (a hard fork is also required here, but you are much more unlikely to convince the other parties in the network to comply).

In terms of Databases and Infrastructure, isn’t holding the entire transaction history very costly with regard to space?

Holding every database state might have been costly in the 90s, but current storage costs (1GB in AWS = $.023) are incredibly cheap. And the benefits (data integrity, ability to issue queries against any point in time) far outweigh the slight cost difference.

Fluree offers an ACID-compliant blockchain distributed ledger that records every state change in history as an immutable changelog entry. It allows for powerful query capability with FlureeDB, a graph query engine. By bringing blockchain to the data tier, Fluree is a practical and powerful infrastructure on which to build, distribute, and scale custom blockchains.