What is Dirty Data?
Simply put, dirty data is information that is incorrect, outdated, inconsistent, or incomplete. Dirty data may take the simple form of a spelling error or an invalid date, or a complex one such as inconsistency because some systems are reporting one set of facts that may have been updated more recently somewhere else. In all cases, dirty data can have massive negative consequences on daily operations or data analytics. Without good data, how are we supposed to make good decisions?
How Common is Dirty Data?
The short answer is: very common. In fact, it’s systemic. Experian recently reported that on average, U.S. organizations believe 32 percent of their data is inaccurate. The correlated impact is equally staggering: in the US alone, bad data costs businesses $3 Trillion Per Year.
And yet, most organizations believe that data is a strategic asset, and that extracting the value out of data is one of the most strategic business imperatives today. So, how did we get here where something so valuable is still so hard to clean and maintain?
Dirty Data’s Origin Story
The reason that Dirty Data is systemic is that its creation is baked into today’s business-as-usual processes for how data is being used. And this is because most companies’ relationship with data has evolved, but the data management processes and technologies have not kept up with the change.
Data was originally the by-product of a business function, such as Sales or Order Fulfillment. Each business function focused on optimizing its business processes through digitization, process automation and software. Business applications produced the data that were needed to operate a process, in a data model that was best understood by the application. In the beginning, there may have been some manual data entry errors here or there, but by and large most business functions maintained the data integrity and quality at the levels required for it to successfully execute its operations.
However, with the introduction of the Internet and Big Data over 20 years ago, we learned that data has extraordinary intrinsic value beyond operating individual business functions. When data from across functions, product lines and geographies was correlated, combined, mixed and merged along with other external data, it became possible to generate new insights that could lead to innovations in customer experience, product design, pricing, sales and service delivery. Data analytics emerged as the hot buzzword and companies began investing in their analytic infrastructure. But, there were a few inherent problems with how data analytics was being done:
- Business applications that create and store data were designed to use data for fast, operational transaction processing, not for doing heavy analytic querying. So, in order to start analyzing data, we started copying over data from operational sources (e.g., the business application’s database) into sources that were better suited and optimized for analytics (e.g., the data warehouse or the data lake).
- Whenever one business function wants to use another function’s data for a new use case, in order to merge the data together we needed to ETL the data. That is, we Extract the data from the source, Transform its initial data model or structure (which was originally created by a business app, exclusively for the business app) into something that could be used by the use case, and then Load the transformed data into the new target source. In addition to copying the data from its source, we have also now changed the data.
- Each time there is a new idea or use case for data, we copy the data again and transform it.
- Sometimes we copy the copy, because it’s easier to access the copy than to get your hands on the original version, and then we transform it and load it again to a new system.
- We copy and copy and copy, and over years of doing this, we now have so many copies that in many cases most organizations can’t tell where the data they are using for a report even came from, what is the authoritative source for any piece of data, and whether their data can be trusted or used.
- And what’s worse, not only does proliferating data through copying not only increase the likelihood of data integrity and quality issues, but it also increases the risk of cybertheft and data leakage.
Dirty Data – A Product of Its Environment
At its core – Dirty Data was the unintended consequence of two factors:
- To be of maximum value, data needs to be used beyond its original intended purpose, by actors in an enterprise outside of the originating business function.
- Each time we use data, we are forced to copy and transform it. The more we copy and transform, the more we create opportunities for data to lose integrity and quality.
Many organizations have begun adopting Data Governance processes to grapple with the viral growth and transmission of Dirty Data. These include processes for managing what should be the authoritative sources for each type of data; collecting and managing meta-data (the result of the data lifecycle processes, such as what is being described in this data, who created it, and where did it come from); and implementing data quality measurement and remediation processes.
However, in order to prioritize, most data governance processes depend on first identifying Critical Data Elements (or CDEs), which are the attributes of data that are most important to certain business functions, and then wrapping data policy and control around those. While it makes sense to focus first on certain key data features that are the most useful first, this is still based on defining what is critical based on what we know today. As we’ve seen, what is not considered critical today could become the most valuable asset in the future. This means that we need to evolve beyond data governance through CDEs if we truly want to eliminate Dirty Data from existence.
How do we move forward?
First, we must enhance and accelerate the data governance activities that have been proven to clean existing dirty data. In other words, stop the bleeding. But then, we need to address Dirty Data at its root by rebuilding the underlying data infrastructure to do away with the need to copy and transform data in order for it to be used across functions. Let’s explore each step further:
Step 1: Clean up your existing dirty data
Move Data Literacy and Data Management to the Front (to the Business)
One of the big challenges in most data governance processes is that they are considered IT problems. IT Developers engineered transformation scripts and data pipelines to copy and move data all over the place, and IT Developers are being asked to create more scripts and build new pipelines to remediate data quality issues. However, most IT Developers don’t actually understand the full context as to how the data is being used, or what the terms really mean. Only the individuals in the Front of the house, who build products or interact with customers, know what the data means for sure.
Business domain subject experts can quickly tell just by looking at it whether certain data is clean or dirty, based on institutional knowledge built over years of effectively doing their jobs. The only challenge is that this knowledge is stuck in their heads. Governance processes that can effectively institutionalize business subject matter expert knowledge are the ones that will be most successful. This can be accomplished by making data literacy and data accountability part of the core business culture, implementing data management procedures with business domain experts in key roles, and supplementing this with techniques and technologies such as Knowledge Graphs and crowdsourcing, that can collect and store business knowledge in a highly collaborative and reusable fashion.
Standardize Data Using a Universal Vocabulary
Because we need to transform data for each use case based on that use case’s target consumption model, another way to stop the bleeding is to start building a reusable and comprehensive vocabulary or business data glossary that is designed to address most use case needs, now and as best as we can anticipate the future. By creating one consistent canonical semantic model inside an organization, this reduces the need for continuously creating new consumption models and then creating new ETL scripts and pipelines.
Organizations can accomplish this by building a framework for semantic interoperability – using standards to define information and relationships under a universal vocabulary. Read more on why semantic interoperability matters here.
Apply Machine Learning Where it Makes Sense
Let’s face it – most data cleaning projects can snowball into massive undertakings, especially with increased data complexity or obscurities that simply cannot be solved manually at scale.
Machine learning can outpace the standard ETL pipeline process by applying an algorithmic process to data cleansing and enrichment. A machine learning approach to data transformation embraces complexity and will get better and faster over time. With machine learning, we can break past the subjective limitations of CDEs and treat any and all data attributes as if they are important. It’s important to pair the crowdsource of knowledge from business Subject Matter Experts (#1) with machine learning to generate the most accurate outcomes the quickest.
Step 2: Get rid of dirty data for good
In order to rid dirty data for good, you must take a step back and apply transformation at the data infrastructure level.
Too many data governance and quality investments focus on cleaning already dirty-data, while the very sources and processes that create the environment for dirty data remain untouched. Much like the difference between preventative healthcare and treatment-centered healthcare, the opportunity costs and business risks associated with being reactive to dirty data rather than proactive in changing the environment for data are enormous.
To proactively address dirty data as its root, we need to go back to the core architecture for how data is being created and stored, the Application-centric architecture. In an Application-centric architecture, the data is stored as the third tier in the application stack in the application’s native vocabulary. Data is a by-product of the application and is inherently siloed.
Emerging new data-centric architectures flip the equation by putting data in the middle, and bringing the business functions to the data rather than copying data and moving it to the function. This new design pattern acknowledges data’s valuable and versatile role in the larger enterprise and industry ecosystem and treats information as the core asset to enterprise architectures.
Take a step back from building more data cleaning silos and pipelines, and look to best practices in data-centric architectures that produce future-proofed “golden records” out of the gate that can be applied to any amount of operational or analytical applications.
There are foundational concepts to data-centric architecture that must be embraced to truly uproot dirty data (and their silos) for good – Security, Trust, Interoperability, and Time – that allow data to become instantly understandable and virtualized across domains.
Ready to get started?
Fluree’s data-centric suite of tools provides organizations with a path forward for better, cleaner data management:
Fluree Sense is the first step in your data transformation journey. Fluree Sense uses machine learning trained by subject matter experts to ingest, classify, and remediate disparate legacy data. Fluree Sense is perfect for getting your existing legacy data into shape.
Fluree Core is your second step. Fluree Core provides a data management platform that enables true data-centric collaboration between any number of data sources and data consumers. Fluree’s core data platform is a source graph database that leverages distributed ledger technology for trust and data lineage, data-centric security for data governance that scales, and semantic interoperability for data exchange and federation across domains. Fluree Sense can feed directly into Fluree Core, allowing your business domains to directly collaborate with your enterprise-wide data ecosystem.
** Special Thanks to Eliud Polanco for Co-Authoring this post