Fluree Blog Blog Post Brian Platz07.06.23

How Our Obsession with Data Became a Hoarding Problem

While data holds immense potential for innovation and progress, our insatiable appetite for collecting and storing it has led to new sets of challenges.

It started innocently enough. Sometime in the mid-2000s, everyone and everything began to generate data. Seeking to derive value from the data, companies hired data analysts. The analysts grew tired of pulling data from multiple sources—such as various SaaS apps—before they could analyze it. IT suggested copying all relevant data into a single data warehouse, where it would be easier to pull and analyze.  

The analysts were happy. IT looked smart. CEOs received data-driven insights. The data warehouses filled up with copies of data. 

Unlike gold bars or vintage cars, however, most data does not appreciate in value. Aside from a few exceptions, such as annual financial statements, the longer you store data, the more it perishes. Outside of the data warehouse, people are constantly interacting with software, and the data updates accordingly. Inside of the warehouse, the old copies grow outdated and useless. 

Around the world, acres of data warehouses brim with useless data. Companies pay thousands to millions of dollars in storage and inventory fees under the assumption that the data they save will come in handy someday. Most of it won’t. Executives are simply soothing their nerves about losing potential insights. 

Surrounded by data FOMO, it’s easy to miss the fact that you don’t need a data warehouse at all. Data can exist in a network rather than a giant storage building. In fact, data is more useful, fresh, secure, and trustworthy that way. Like the people in the TV show Hoarders, it’s time to let go of old notions of meaning and tidy up how we perceive and use data. 

The Data Warehouse Becomes Musty

There are three reasons to be wary of the data warehouse. 

1. Most of your data is perishable.

2. More data does not lead to better insights.

3. Data security disappears. 

To paraphrase Marie Kondo, there is cost-saving magic in tidying your data. A midsized company with a warehouse full of old data might be paying 5% of its profits in data storage while only ever using 2% of that data. If a team takes the time to figure out what data it’s relatively sure it will use and trashes the rest, data warehousing might only cost 1% of profits.

But, you might say, big tech companies like Google and Microsoft have entire warehouses full of digital exhaust, and keep finding new use cases for it, most recently in their AI models. Isn’t there a chance of figuring out how to use that old data eventually? Yes, there’s a chance—especially if you’re a tech giant that has been collecting data since the early 2000s and working on AI for about as long. Unless you’re playing in that league, with equivalent resources, your data will probably just continue to perish. Better to work with what you can use and, if AI is a concern, see whose model and data set you can access instead of trying to become a down-market version of Big Tech (or collect the kind of niche data that Big Tech won’t focus on – which also requires thought and deliberation). 

Another potential concern with the data warehouse is that data loses permissions and security. 

You can create all the permissions you want for a SaaS app. Once you rip data out of the back end and dump it into a warehouse, however, all those permissions are stripped away. It’s a requirement for data analysis. If someone steals credentials, sensitive data is exposed. Re-implementing the SaaS permission model in a separate system is an option, but costs time and money, and complicates workflows. 

There is an alternative to the data warehouse. It’s called the data network. To understand how it works, it’s worth looking at the manufacturing sector as an analogy. 

Just-In-Time Inventory

In the 1980s, traditional manufacturing was up-ended through just-in-time practices. Manufacturers built small factories that responded to product demand. Instead of storing mass-produced inventory in warehouses and waiting for demand to strike, manufacturers could produce responsively, and then send products to be fulfilled by a third party. 

Similarly, for many use cases, storing big data in a big warehouse makes no sense. If you set up and manage your data strategically, you may not need to move it into a giant warehouse at all. 

Instead, you can use decentralized data, which I covered in my latest post at Forbes Tech Column. In short, decentralized data is akin to hyperlinking data the way we currently link websites on the internet. Networks of data are created through these links, which are stored in a semantic knowledge graph database. Whenever you query the database, results come from the data network. The data itself is constantly updated as people interact with the software generating it. Each piece of data is fresh, and there is no need for a warehouse. 

Decentralized data is also secure. It doesn’t all sit in one warehouse, stripped of permissions. Rather, it exists at its point of origin, linked through the knowledge graph database. Nobody needs to make copies. Whoever manages the data can also wrap it in permissions. Anyone who queries the data has to meet those permissions, which reduces security risks. Because data lives at its point of origin, each piece of sensitive data can also come with a history of its own creation and use, so that whoever queries it knows they can trust it.

I have a dream about the entire internet operating in this way, with linked, secure and trustworthy data. That’s called Web3, and it is still very much a work in progress. Any organization, however, can begin building data networks right now and growing them over time. Even companies with massive stockpiles of old data in a warehouse can start by auditing and moving their data into networks. And for anyone considering buying space in a data warehouse, I have this advice: Don’t do it. Take your data seriously. Start with where the internet is today, not where it was 20 years ago. Particularly as AI integrates into almost every workflow, and the stakes for trustworthy, secure data become higher than ever, decentralized data is a bet that will pay off in the long run.