Local or Cloud
AI/ML Data Cleansing
Golden Record Pipeline
486 Patterson Ave
Winston-Salem, NC 27101
– – –
11 Park Place
New York, NY, 1007
– – –
Bagmane Laurel, Krishnappa
Garden, C V Raman Nagar,
Karnataka 560093, India
– – –
1644 Platte Street
Denver, CO 80202
– – –
Lange Dreef 11
4131 NJ Vianen
We’ve all heard of machine learning – a subset of Artificial Intelligence that can make inferences based on patterns in data. And if you’re involved in big data analytics or enterprise data management, you’ve also probably discovered the potential benefits of applying machine learning to automate certain components of the data management process. In the context of master data management, we are starting to see more use of machine learning to clean, de-duplicate, and “master” enterprise data. Today, we’ll cover the benefits of supervised machine learning with training sets before unleashing the algorithm on new data. We’ll also dive into the benefits of using human subject matter experts to apply additional feedback and training to machine learning algorithms.
Supervised machine learning is a type of model that is trained on labeled data, meaning that the data used to train the model includes the desired output or target variable. It is further reinforced by continuous feedback (think of “likes” or the “thumbs up” button). The goal of supervised machine learning is to discover patterns in the input data that it was trained on, generalize the patterns, and then use those generalizations to make predictions on new, unseen or unlabeled data in the future.
Supervised machine learning can be an effective approach for improving the efficiency and accuracy of master data management. The traditional method of Master Data Management involves (1) taking data from multiple data source as inputs, (2) conforming them to the MDM’s native data model for processing the data, (3) building a set of “fuzzy logic” rules for matching together entities that belong together, and then (4) defining ‘survivorship’ rules for merging the data from the source systems into one Golden Record for each entity.
These traditional methods of conforming the source data, defining the matching rules, and then defining the survivorship rules can be time-consuming as they are often discovered through trial-and-error and lots of experimentation. They are also prone to failure as they cannot necessarily account for variability or significant changes to the data source inputs. Supervised machine learning, on the other hand, uses training data to recognize patterns and discover the logic for conforming, matching and merging the data – without needing analysts and engineers to manually hand-author the rules in advance.
In the context of Master Data Management, supervised machine learning can be used to (1) identify records which are of the same entity type; (2) cluster records and identify matches between them; and (3) figure out which data values to survive when creating the golden record.
For example, by providing some samples of labeled data of one entity type – say an Organization – machine learning-based data classification models can scan data from a variety of data sources and look for other tables and records of a similar Organization type. Then, other machine learning algorithms can learn under what conditions records of the same entity should be matched together. Is it if two entities have the same name? What if the name has changed but the addresses are the same? What if the record for one company has the name “Acme Inc.” but another record has a company with the name of “Acme LLC”? Are they the same?
Trying to work out all of the possible permutations of rules to discover when something is or isn’t a match can take forever, but for machine learning models that is quite easy to do. And, depending on the amount of training provided, the resulting models can be more accurate and efficient compared to hand-written rules.
Supervised machine learning can bring many benefits to master data management (MDM) by improving speed, scalability and accuracy of the data-mastering process.
Firstly, supervised machine learning can significantly speed up the process of getting an MDM program initiated. Most MDM projects can take weeks to start, as Developers first have to do a bunch of ETL and data engineering to take the source data and conform it to the MDM system’s native data model. Then, once the data is loaded into the MDM system, the Developers can now start programming in the matching and survivorship rules, in collaboration with Business Analysts. After iterative tuning and manually recalibrating the rules (e.g., “match the records together if the names match by 70% instead of 75%”), records across common entities can be clustered together to start generating Golden Records. But this process may have already taken months before you see any results. By reducing the need for manual ETL and rule-tuning efforts, machine learning eliminates the most time-consuming part of data mastering and can start generating results within days or at worst, weeks.
Secondly, supervised machine learning allows for greater scalability when dealing with both large amounts and a large variety of data sources. A traditional rules-based approach can be effective for a reasonable number of datasets, but when the volume or variety of data sources being mastered changes, the rules authored before may no longer apply. Each new data source would then require its own data transformation, rule definition for matching, and rule definition for survivorship. By contrast, machine learning processes get smarter the more data is fed into it. Instead of the amount of time and effort scaling linearly with each new data source under a traditional rules-based MDM approach, the amount of time and effort reduces exponentially in a machine learning-based approach.
Lastly, supervised machine learning improves accuracy in data mastering. A rules-based data mastering solution can only match records that fit the exact conditions of the rules, and developers have to define the precise conditions of the rules (e.g., “these two entities can be matched if the names match by 50%, and the addresses match by 60%”). However, most rules typically work 80-90% as defined; in very large data volumes even a 5% error rate can be a big deal! With machine learning, patterns with much richer features than the hand-written rules can be discovered which can improve the accuracy by up to 98 or 99%. This improves trust in data and enables informed business decisions.
Let’s explore a scenario in the financial services industry where Supervised Machine Learning will significantly reduce overhead and time-to-value in data mastering. In this scenario, a bank needs to maintain a master list of all its customers, which includes information such as name, address, phone number, and account information. The bank receives customer data from multiple sources, including online applications, branch visits, and third-party providers.
Without machine learning, the bank would have to manually review and cleanse the data, which would be a time-consuming and costly process. The bank would have to manually identify duplicates, correct errors, and standardize the data. This process would require a large number of data analysts and would take several months to complete.
However, by using supervised machine learning, the bank can automate the data cleansing process. The bank can provide the machine learning model with a set of labeled data, which includes examples of correct and incorrect data. The model can then learn to identify patterns in the data and make predictions about the data it has not seen before. The bank can also use the model to identify and merge duplicate records, correct errors, and standardize the data.
In this scenario, supervised machine learning can significantly reduce the cost and effort required to maintain a master list of customers. The bank can reduce the number of data analysts required and complete the data cleansing process in a fraction of the time it would take using traditional methods. This allows the bank to focus on more important tasks and make better use of its resources.
Pairing supervised machine learning with a continuous process of “crowdsourcing” feedback from human subject matter experts can bring additional benefits to data mastering. Subject matter experts have a deep understanding of the data and the business context in which it is used. They can provide valuable insights into the data and help identify patterns and relationships that may be difficult for the machine-learning model to detect on its own.
One of the main benefits of pairing machine learning with subject matter experts is the ability to improve the accuracy of the model. Subject matter experts can provide the machine learning model with labeled data that is accurate and representative of real-world data. This can help the model learn to make more accurate predictions and reduce the number of errors. Additionally, subject matter experts can also help identify and correct errors made by the model during the training process, which further improves the accuracy of the model.
Another benefit of pairing machine learning with subject matter experts is the ability to improve the interpretability of the model. Machine learning models can be opaque and difficult to understand, which can make it challenging to trust the results and take action on them. Subject matter experts can help explain the model’s predictions and provide context to the results, which increases the transparency and interpretability of the model.
Finally, involving subject matter experts in the training process helps to improve the acceptance and adoption of the machine learning solution among the users. Subject matter experts can act as a bridge between the data science team and the end users, communicating the value of the solution and addressing any concerns that may arise. It also allows experts to set the right expectations and ensure the solution addresses real business needs.
Fluree Sense uses supervised machine learning trained by subject matter experts to ingest, classify, and remediate disparate legacy data. Fluree Sense is perfect for getting your existing legacy data into shape. Fluree Sense uses supervised machine learning to automatically identify and correct errors, merge duplicate records, and standardize data. The machine learning model is trained by human subject matter experts who provide labeled data that is accurate and representative of real-world data. This ensures that the model is able to make accurate predictions and reduce errors.
Fluree Sense offers an interactive interface that allows for easy monitoring of the data cleansing process and allows for real-time feedback and adjustments by human subject matter experts. This ensures that the data cleansing process is fully transparent and interpretable, which increases trust in the data and enables informed business decisions. Don’t wait to transform your data, start your Fluree Sense journey here!
Simply put, dirty data is information that is incorrect, outdated, inconsistent, or incomplete. Dirty data may take the simple form of a spelling error or an invalid date, or a complex one such as inconsistency because some systems are reporting one set of facts that may have been updated more recently somewhere else. In all cases, dirty data can have massive negative consequences on daily operations or data analytics. Without good data, how are we supposed to make good decisions?
The short answer is: very common. In fact, it’s systemic. Experian recently reported that on average, U.S. organizations believe 32 percent of their data is inaccurate. The correlated impact is equally staggering: in the US alone, bad data costs businesses $3 Trillion Per Year.
And yet, most organizations believe that data is a strategic asset and that extracting the value out of data is one of the most strategic business imperatives today. So, how did we get here where something so valuable is still so hard to clean and maintain?
The reason that Dirty Data is systemic is that its creation is baked into today’s business-as-usual processes for how data is being used. And this is because most companies’ relationship with data has evolved, but the data management processes and technologies have not kept up with the change.
Data was originally the by-product of a business function, such as Sales or Order Fulfillment. Each business function focused on optimizing its business processes through digitization, process automation, and software. Business applications produced the data that were needed to operate a process, in a data model that was best understood by the application. In the beginning, there may have been some manual data entry errors here or there, but by and large most business functions maintained the data integrity and quality at the levels required for it to successfully execute its operations.
However, with the introduction of the Internet and Big Data over 20 years ago, we learned that data has extraordinary intrinsic value beyond operating individual business functions. When data from across functions, product lines, and geographies was correlated, combined, mixed, and merged along with other external data, it became possible to generate new insights that could lead to innovations in customer experience, product design, pricing, sales, and service delivery. Data analytics emerged as the hot buzzword and companies began investing in their analytic infrastructure. But, there were a few inherent problems with how data analytics was being done:
At its core – Dirty Data was the unintended consequence of two factors:
1. To be of maximum value, data needs to be used beyond its original intended purpose, by actors in an enterprise outside of the originating business function.
2. Each time we use data, we are forced to copy and transform it. The more we copy and transform, the more we create opportunities for data to lose integrity and quality.
Many organizations have begun adopting Data Governance processes to grapple with the viral growth and transmission of Dirty Data. These include processes for managing what should be the authoritative sources for each type of data; collecting and managing meta-data (the result of the data lifecycle processes, such as what is being described in this data, who created it, and where it comes from); and implementing data quality measurement and remediation processes.
However, in order to prioritize, most data governance processes depend on first identifying Critical Data Elements (or CDEs), which are the attributes of data that are most important to certain business functions, and then wrapping data policy and control around those. While it makes sense to focus first on certain key data features that are the most useful first, this is still based on defining what is critical based on what we know today. As we’ve seen, what is not considered critical today could become the most valuable asset in the future. This means that we need to evolve beyond data governance through CDEs if we truly want to eliminate Dirty Data from existence.
First, we must enhance and accelerate the data governance activities that have been proven to clean existing dirty data. In other words, stop the bleeding. But then, we need to address Dirty Data at its root by rebuilding the underlying data infrastructure to do away with the need to copy and transform data in order for it to be used across functions. Let’s explore each step further:
1. Move Data Literacy and Data Management to the Front (to the Business)
One of the big challenges in most data governance processes is that they are considered IT problems. IT Developers engineered transformation scripts and data pipelines to copy and move data all over the place, and IT Developers are being asked to create more scripts and build new pipelines to remediate data quality issues. However, most IT Developers don’t actually understand the full context as to how the data is being used, or what the terms really mean. Only the individuals in the Front of the house, who build products or interact with customers, know what the data means for sure.
Business domain subject experts can quickly tell just by looking at it whether certain data is clean or dirty, based on institutional knowledge built over years of effectively doing their jobs. The only challenge is that this knowledge is stuck in their heads. Governance processes that can effectively institutionalize business subject matter expert knowledge are the ones that will be most successful. This can be accomplished by making data literacy and data accountability part of the core business culture, implementing data management procedures with business domain experts in key roles, and supplementing this with techniques and technologies such as Knowledge Graphs and crowdsourcing, that can collect and store business knowledge in a highly collaborative and reusable fashion.
2. Standardize Data Using a Universal Vocabulary
Because we need to transform data for each use case based on that use case’s target consumption model, another way to stop the bleeding is to start building a reusable and comprehensive vocabulary or business data glossary that is designed to address most use case needs, now and as best as we can anticipate the future. Creating one consistent canonical semantic model inside an organization reduces the need for continuously creating new consumption models and then creating new ETL scripts and pipelines.
Organizations can accomplish this by building a framework for semantic interoperability – using standards to define information and relationships under a universal vocabulary. Read more on why semantic interoperability matters here.
3. Apply Machine Learning Where it Makes Sense
Let’s face it – most data-cleaning projects can snowball into massive undertakings, especially with increased data complexity or obscurities that simply cannot be solved manually at scale.
Machine learning can outpace the standard ETL pipeline process by applying an algorithmic process to data cleansing and enrichment. A machine-learning approach to data transformation embraces complexity and will get better and faster over time. With machine learning, we can break past the subjective limitations of CDEs and treat any and all data attributes as if they are important. It’s important to pair the crowdsource of knowledge from business Subject Matter Experts (#1) with machine learning to generate the most accurate outcomes the quickest.
In order to rid dirty data for good, you must take a step back and apply transformation at the data infrastructure level.
Too many data governance and quality investments focus on cleaning already dirty data, while the very sources and processes that create the environment for dirty data remain untouched. Much like the difference between preventative healthcare and treatment-centered healthcare, the opportunity costs and business risks associated with being reactive to dirty data rather than proactive in changing the environment for data are enormous.
To proactively address dirty data as its root, we need to go back to the core architecture for how data is being created and stored, the Application-centric architecture. In an Application-centric architecture, the data is stored as the third tier in the application stack in the application’s native vocabulary. Data is a by-product of the application and is inherently siloed.
Emerging new data-centric architectures flip the equation by putting data in the middle, and bringing the business functions to the data rather than copying data and moving it to the function. This new design pattern acknowledges data’s valuable and versatile role in the larger enterprise and industry ecosystem and treats information as the core asset to enterprise architectures.
Take a step back from building more data-cleaning silos and pipelines, and look to best practices in data-centric architectures that produce future-proofed “golden records” out of the gate that can be applied to any amount of operational or analytical applications.
There are foundational concepts to data-centric architecture that must be embraced to truly uproot dirty data (and their silos) for good – Security, Trust, Interoperability, and Time – that allow data to become instantly understandable and virtualized across domains.
Fluree’s data-centric suite of tools provides organizations with a path forward for better, cleaner data management:
Fluree Sense is the first step in your data transformation journey. Fluree Sense uses machine learning trained by subject matter experts to ingest, classify, and remediate disparate legacy data. Fluree Sense is perfect for getting your existing legacy data into shape.Fluree Core is your second step. Fluree Core provides a data management platform that enables true data-centric collaboration between any number of data sources and data consumers. Fluree’s core data platform is a source graph database that leverages distributed ledger technology for trust and data lineage, data-centric security for data governance that scales, and semantic interoperability for data exchange and federation across domains. Fluree Sense can feed directly into Fluree Core, allowing your business domains to directly collaborate with your enterprise-wide data ecosystem.
** Special Thanks to Eliud Polanco for Co-Authoring this post
Follow us on Linkedin
Join our Mailing List
Subscribe to our LinkedIn Newsletter
Subscribe to our YouTube channel
Partner, Analytic Strategy Partners; Frederick H. Rawson Professor in Medicine and Computer Science, University of Chicago and Chief of the Section of Biomedical Data Science in the Department of Medicine
Robert Grossman has been working in the field of data science, machine learning, big data, and distributed computing for over 25 years. He is a faculty member at the University of Chicago, where he is the Jim and Karen Frank Director of the Center for Translational Data Science. He is the Principal Investigator for the Genomic Data Commons, one of the largest collections of harmonized cancer genomics data in the world.
He founded Analytic Strategy Partners in 2016, which helps companies develop analytic strategies, improve their analytic operations, and evaluate potential analytic acquisitions and opportunities. From 2002-2015, he was the Founder and Managing Partner of Open Data Group (now ModelOp), which was one of the pioneers scaling predictive analytics to large datasets and helping companies develop and deploy innovative analytic solutions. From 1996 to 2001, he was the Founder and CEO of Magnify, which is now part of Lexis-Nexis (RELX Group) and provides predictive analytics solutions to the insurance industry.
Robert is also the Chair of the Open Commons Consortium (OCC), which is a not-for-profit that manages and operates cloud computing infrastructure to support scientific, medical, health care and environmental research.
Connect with Robert on Linkedin
Founder, DataStraits Inc., Chief Revenue Officer, 3i Infotech Ltd
Sudeep Nadkarni has decades of experience in scaling managed services and hi-tech product firms. He has driven several new ventures and corporate turnarounds resulting in one IPO and three $1B+ exits. VC/PE firms have entrusted Sudeep with key executive roles that include entering new opportunity areas, leading global sales, scaling operations & post-merger integrations.
Sudeep has broad international experience having worked, lived, and led firms operating in US, UK, Middle East, Asia & Africa. He is passionate about bringing innovative business products to market that leverage web 3.0 technologies and have embedded governance risk and compliance.
Connect with Sudeep on Linkedin
CEO, Data4Real LLC
Julia Bardmesser is a technology, architecture and data strategy executive, board member and advisor. In addition to her role as CEO of Data4Real LLC, she currently serves as Chair of Technology Advisory Council, Women Leaders In Data & AI (WLDA). She is a recognized thought leader in data driven digital transformation with over 30 years of experience in building technology and business capabilities that enable business growth, innovation, and agility. Julia has led transformational initiatives in many financial services companies such as Voya Financial, Deutsche Bank Citi, FINRA, Freddie Mac, and others.
Julia is a much sought-after speaker and mentor in the industry, and she has received recognition across the industry for her significant contributions. She has been named to engatica 2023 list of World’s Top 200 Business and Technology Innovators; received 2022 WLDA Changemaker in AI award; has been named to CDO Magazine’s List of Global Data Power Wdomen three years in the row (2020-2022); named Top 150 Business Transformation Leader by Constellation Research in 2019; and recognized as the Best Data Management Practitioner by A-Team Data Management Insight in 2017.
Connect with Julia on Linkedin
Senior Advisor, Board Member, Strategic Investor
After nine years leading the rescue and turnaround of Banco del Progreso in the Dominican Republic culminating with its acquisition by Scotiabank (for a 2.7x book value multiple), Mark focuses on advisory relationships and Boards of Directors where he brings the breadth of his prior consulting and banking/payments experience.
In 2018, Mark founded Alberdi Advisory Corporation where he is engaged in advisory services for the biotechnology, technology, distribution, and financial services industries. Mark enjoys working with founders of successful businesses as well as start-ups and VC; he serves on several Boards of Directors and Advisory Boards including MPX – Marco Polo Exchange – providing world-class systems and support to interconnect Broker-Dealers and Family Offices around the world and Fluree – focusing on web3 and blockchain. He is actively engaged in strategic advisory with the founder and Executive Committee of the Biotechnology Institute of Spain with over 50 patents and sales of its world-class regenerative therapies in more than 30 countries.
Prior work experience includes leadership positions with MasterCard, IBM/PwC, Kearney, BBVA and Citibank. Mark has worked in over 30 countries – extensively across Europe and the Americas as well as occasional experiences in Asia.
Connect with Mark on Linkedin
Chair of the Board, Enterprise Data Management Council
Peter Serenita was one of the first Chief Data Officers (CDOs) in financial services. He was a 28-year veteran of JPMorgan having held several key positions in business and information technology including the role of Chief Data Officer of the Worldwide Securities division. Subsequently, Peter became HSBC’s first Group Chief Data Officer, focusing on establishing a global data organization and capability to improve data consistency across the firm. More recently, Peter was the Enterprise Chief Data Officer for Scotiabank focused on defining and implementing a data management capability to improve data quality.
Peter is currently the Chairman of the Enterprise Data Management Council, a trade organization advancing data management globally across industries. Peter was a member of the inaugural Financial Research Advisory Committee (under the U.S. Department of Treasury) tasked with improving data quality in regulatory submissions to identify systemic risk.
Connect with Peter on Linkedin
Turn Data Chaos into Data Clarity
"*" indicates required fields
Enter details below to access the whitepaper.
Pawan came to Fluree via its acquisition of ZettaLabs, an AI based data cleansing and mastering company.His previous experiences include IBM where he was part of the Strategy, Business Development and Operations team at IBM Watson Health’s Provider business. Prior to that Pawan spent 10 years with Thomson Reuters in the UK, US, and the Middle East. During his tenure he held executive positions in Finance, Sales and Corporate Development and Strategy. He is an alumnus of The Georgia Institute of Technology and Georgia State University.
Connect with Pawan on Linkedin
Andrew “Flip” Filipowski is one of the world’s most successful high-tech entrepreneurs, philanthropists and industry visionaries. Mr. Filipowski serves as Co-founder and Co-CEO of Fluree, where he seeks to bring trust, security, and versatility to data.
Mr. Filipowski also serves as co-founder, chairman and chief executive officer of SilkRoad Equity, a global private investment firm, as well as the co-founder, of Tally Capital.
Mr. Filipowski was the former COO of Cullinet, the largest software company of the 1980’s. Mr. Filipowski founded and served as Chairman and CEO of PLATINUM technology, where he grew PLATINUM into the 8th largest software company in the world at the time of its sale to Computer Associates for $4 billion – the largest such transaction for a software company at the time. Upside Magazine named Mr. Filipowski one of the Top 100 Most Influential People in Information Technology. A recipient of Entrepreneur of the Year Awards from both Ernst & Young and Merrill Lynch, Mr. Filipowski has also been awarded the Young President’s Organization Legacy Award and the Anti-Defamation League’s Torch of Liberty award for his work fighting hate on the Internet.
Mr. Filipowski is or has been a founder, director or executive of various companies, including: Fuel 50, Veriblock, MissionMode, Onramp Branding, House of Blues, Blue Rhino Littermaid and dozens of other recognized enterprises.
Connect with Flip on Linkedin
Brian is the Co-founder and Co-CEO of Fluree, PBC, a North Carolina-based Public Benefit Corporation.
Platz was an entrepreneur and executive throughout the early internet days and SaaS boom, having founded the popular A-list apart web development community, along with a host of successful SaaS companies. He is now helping companies navigate the complexity of the enterprise data transformation movement.
Previous to establishing Fluree, Brian co-founded SilkRoad Technology which grew to over 2,000 customers and 500 employees in 12 global offices. Brian sits on the board of Fuel50 and Odigia, and is an advisor to Fabric Inc.
Connect with Brian on Linkedin
Eliud Polanco is a seasoned data executive with extensive experience in leading global enterprise data transformation and management initiatives. Previous to his current role as President of Fluree, a data collaboration and transformation company, Eliud was formerly the Head of Analytics at Scotiabank, Global Head of Analytics and Big Data at HSBC, head of Anti-Financial Crime Technology Architecture for U.S.DeutscheBank, and Head of Data Innovation at Citi.
In his most recent role as Head of Analytics and Data Standards at Scotiabank, Eliud led a full-spectrum data transformation initiative to implement new tools and technology architecture strategies, both on-premises as well as on Cloud, for ingesting, analyzing, cleansing, and creating consumption ready data assets.
Connect with Eliud on Linkedin
Get the right data into the right hands.
Build your Verifiable Credentials/DID solution with Fluree.
Wherever you are in your Knowledge Graph journey, Fluree has the tools and technology to unify data based on universal meaning, answer complex questions that span your business, and democratize insights across your organization.
Build real-time data collaboration that spans internal and external organizational boundaries, with protections and controls to meet evolving data policy and privacy regulations.
Fluree Sense auto-discovers data fitting across applications and data lakes, cleans and formats them into JSON-LD, and loads them into Fluree’s trusted data platform for sharing, analytics, and re-use.
Transform legacy data into linked, semantic knowledge graphs. Fluree Sense automates the data mappings from local formats to a universal ontology and transforms the flat files into RDF.
Whether you are consolidating data silos, migrating your data to a new platform, or building an MDM platform, we can help you build clean, accurate, and reliable golden records.
Our enterprise users receive exclusive support and even more features. Book a call with our sales team to get started.
Download Stable Version
Download Pre-Release Version
Register for Alpha Version
By downloading and running Fluree you agree to our terms of service (pdf).
Hello this is some content.