Seamless Data Sharing, Baked Right In

As things start to wind down at the end of the year, those of us in tech start looking for easy wins, like updating our email signatures to match the new company branding guidelines, or scheduling a “first thing next year” check-in for that data integration effort that has as much chance of reporting progress as Santa’s milk and cookies have seeing Christmas Day.

We who manage data and share it with others knowingly smile because we’ve all seen a “data integration” project or even a “simple data request” that seems straightforward but is actually fraught with unforeseen challenges and setbacks. Why is it that these kinds of efforts are so widely known to cause grief? It’s because in a career that’s so often filled with miraculous technological advances and software tools that exponentially narrow development time and widen impact of our efforts, we feel ridiculous estimating (and inevitably extending) a timeline of half a year or more for any project that involves sharing existing data across applications and firewalls.

Let’s get into the fundamentals of sharing data, expose the challenges lurking in the implementation details, and learn why projects that use Fluree need not be concerned about any of this thanks to a few open web standards – baked right into Fluree – that enable seamless data sharing.

The chestnut of this article came from our walkthrough documentation on the fictional Academic Credential Dataset. If you’d like to dive deeper into solving the issues tackled in this post, head on over to our docs site to see what it’s like solving hard problems with Fluree.

The Data Sharing Problem

In our fantastic Collaborative Data doc, we highlight the core problem with sharing data: naming! Because you and I describe the same things in different ways – my Customers table and your CustomerDetails table might both describe Freddy the Yeti and may even contain the same exact data for Freddy, but in typical database systems, there is no way for us to know we’re describing the same yeti, nor can we even compare the data we have on Freddy from our separate systems without a mapping definition and process.

In fact, when there are any discrepancies between the source and destination schemas, the task of sharing data shifts from a self-service and, dare I say, automated task to a more complex endeavor that requires business analysts, developers, database administrators, and, of course, project managers to corral the whole circus of defining, communicating, building, and validating the necessary data pipeline. I may be exaggerating a bit, especially for smaller requests where a simple export of a small set of records is concerned, but even for a small task, some understanding of the target data and context is required by the provider and consumer to make sense of the request and resulting data. Note that in the case of building a data pipeline, much of the burden is put on the shoulders of tech knowledge workers where we’re expected to learn and reason about multiple contexts and construct systems integrations that must communicate over time and space and handle edge cases and dirty data and will eventually be asked to shoulder the weight of maintenance and changes of requirements and feature creep. This is where the cost and grief comes from.

If maintaining a consistent mapping is crucial for the data owner, it can be achieved by layering it in as the source data is added. However, this approach often results in data duplication, as the information must be stored in multiple formats. Alternatively, the mapping can be automated and done on-the-fly, as the data is requested, but this takes development resources and, depending on the amount of data and frequency of requests, can get expensive (and deeply annoying). Neither of these methods takes into account scenarios where mappings evolve over time or when there are numerous requestors, each with their unique data format requirements.

Two different views of Freddy the yeti, each with their own context.

Okay so differing data schemas mean trouble for data sharing, making it complex, expensive and generally slow. So why do we have these differences in the first place? If it’s such an effort to map and transform data, why can’t the receiver just use the same schema as the sender? Or vice versa?

There are many reasons that vary with size and scale, but most of them boil down to communication, coordination, and cost.

Different Goals, Different Contexts

When building data exports, APIs, and other data sharing infrastructure, data owners lean on their own internal understandings of their data. There are intrinsic properties of the data (e.g. relationships, data types, field names) that only exist as a byproduct of the context in which the data owner collected and generated the data, and yet these properties dictate the format, shape, and content of the data being shared. On the other hand, each and every data consumer (those that receive data from data owners) have another distinct understanding of the data they’re interested in. They operate within a different context. They have applications and analytics that are built on data of a certain shape with certain properties that have certain names and types.

Efficiently conveying these distinct contexts and ensuring that everyone consistently employs the same data context for their specific use cases can appear to be an insurmountable challenge. On the other hand, if we permit each consumer to maintain their own context, any modification to the data sharing infrastructure necessitates an equivalent degree of communication and coordination, resulting in each individual bearing the cost of staying up-to-date.

The challenges with data formatting and mapping make sharing data and hosting data sources difficult to accomplish and, when successful, constrained to niche, data-intensive research fields that require a consistent context. To mitigate, these problem spaces must rely on centralized data brokers that dictate the sharing format and other rules of engagement. This setup means relinquishing data ownership and control, reduced benefit of data partnerships, and the limited reach of knowledge and information.

The tl;dr is: the current state of data infrastructure can only produce data silos which constrain the impact of our data.

In an ideal world, we would all use the same schema, the same data context. We would use the same table names, use the same column names and data types, and, while we’re dreaming, we’d use the same identifiers for our records! That way, there’s no need for a translation layer, there’s just retrieving data. Seems silly right? There’s no way we can all use the same database schema. What would that even mean?

Fluree builds on web standards to provide essential features that, when combined, solve these traditionally challenging problems. One of these standards is the JSON-LD format, which gives our data the ability to describe itself enabling portability beyond the infrastructure where it originated. We call this “schema on write” and “schema on read,” which just means developers can build their databases on top of universal standards, and that data can be immediately shared, pre-packaged with mappings and context for consumer use cases. Let’s take a closer look at how Fluree’s approach to data management obviates these problems.

The Data Sharing Solution

What does it mean for our data to be able to “describe itself” and how does this concept solve these longstanding data-sharing problems? I mentioned the term “context” a bit in my statement of the problem. In a nutshell, the data context contains all of the semantic meaning of the data. This includes field names, relationships among objects and concepts, type information like is this data a number, a date, or a more complex class like an invoice or a patient visit. This contextual data is traditionally defined all over the place: in table and column definitions, in application and API layers, in data dictionaries distributed with static datasets, in notes columns in a spreadsheet, in the head of the overworked and under-resourced data architect. This contextual data, as discussed, can be difficult to maintain, represent, and transfer to data consumers.

But wait! What if all this data was stored and retrieved as data in the dataset? What would it look like if we took all of the contextual data that can be found in the best, most-complete data dictionary, API documentation, or SQL schema explorer and just inject it right in there with the data content itself? JSON-LD and a few other open web standards, like RDF and RDFS, do this exactly and Fluree relies on them to enable simple and seamless data sharing.

RDF and JSON-LD are simply data formats that can represent graph data. We go into more detail in our Data Model doc and there are some excellent resources online as RDF has been around for a bit. RDFS is an extension of RDF that adds some very useful data modeling concepts like classes and subclasses, which enables us to describe the hierarchies in our data. JSON-LD and its ability to convey contextual and vocabulary data alongside the data itself is talked about extensively in our excellent Collaborative Data doc.

The gist is that by using universally defined identifiers for both subjects and properties, all participants (both data sources and consumers) can build on top of a fully-defined and open understanding of the data schema. No more data silos.

Oh hey! Thanks for reading! I’ll leave you with another benefit of using Fluree: portability! Portability, the opposite of “vendor lock in”, is another one of Fluree’s incredible side effects. Because Fluree is built on open standards, like the ones discussed in this article, all of the value provided is baked right into the data itself! This means that Fluree is relying on externally-defined mechanisms (including the storage format, RDF) that have meaning outside of any Fluree database or platform. So when sharing your data or if you decide to use a different database in the future, all of the self-describability goes along for the ride!