Improving Master Data Management with Supervised Machine Learning

We’ve all heard of machine learning – a subset of Artificial Intelligence that can make inferences based on patterns in data. And if you’re involved in big data analytics or enterprise data management, you’ve also probably discovered the potential benefits of applying machine learning to automate certain components of the data management process. In the context of master data management, we are starting to see more use of machine learning to clean, de-duplicate, and “master” enterprise data. Today, we’ll cover the benefits of supervised machine learning with training sets before unleashing the algorithm on new data. We’ll also dive into the benefits of using human subject matter experts to apply additional feedback and training to machine learning algorithms.

What is Supervised Machine Learning?

Supervised machine learning is a type of model that is trained on labeled data, meaning that the data used to train the model includes the desired output or target variable. It is further reinforced by continuous feedback (think of “likes” or the “thumbs up” button). The goal of supervised machine learning is to discover patterns in the input data that it was trained on, generalize the patterns, and then use those generalizations to make predictions on new, unseen or unlabeled data in the future.

Supervised Machine Learning for Master Data Management

Supervised machine learning can be an effective approach for improving the efficiency and accuracy of master data management. The traditional method of Master Data Management involves (1) taking data from multiple data source as inputs, (2) conforming them to the MDM’s native data model for processing the data, (3) building a set of “fuzzy logic” rules for matching together entities that belong together, and then (4) defining ‘survivorship’ rules for merging the data from the source systems into one Golden Record for each entity.

These traditional methods of conforming the source data, defining the matching rules, and then defining the survivorship rules can be time-consuming as they are often discovered through trial-and-error and lots of experimentation. They are also prone to failure as they cannot necessarily account for variability or significant changes to the data source inputs. Supervised machine learning, on the other hand, uses training data to recognize patterns and discover the logic for conforming, matching and merging the data – without needing analysts and engineers to manually hand-author the rules in advance.

In the context of Master Data Management, supervised machine learning can be used to (1) identify records which are of the same entity type; (2) cluster records and identify matches between them; and (3) figure out which data values to survive when creating the golden record.

For example, by providing some samples of labeled data of one entity type – say an Organization – machine learning-based data classification models can scan data from a variety of data sources and look for other tables and records of a similar Organization type. Then, other machine learning algorithms can learn under what conditions records of the same entity should be matched together. Is it if two entities have the same name? What if the name has changed but the addresses are the same? What if the record for one company has the name “Acme Inc.” but another record has a company with the name of “Acme LLC”? Are they the same?

Trying to work out all of the possible permutations of rules to discover when something is or isn’t a match can take forever, but for machine learning models that is quite easy to do. And, depending on the amount of training provided, the resulting models can be more accurate and efficient compared to hand-written rules.

Key Benefits: Speed, Scalability, & Accuracy

Supervised machine learning can bring many benefits to master data management (MDM) by improving speed, scalability and accuracy of the data-mastering process.

Firstly, supervised machine learning can significantly speed up the process of getting an MDM program initiated. Most MDM projects can take weeks to start, as Developers first have to do a bunch of ETL and data engineering to take the source data and conform it to the MDM system’s native data model. Then, once the data is loaded into the MDM system, the Developers can now start programming in the matching and survivorship rules, in collaboration with Business Analysts. After iterative tuning and manually recalibrating the rules (e.g., “match the records together if the names match by 70% instead of 75%”), records across common entities can be clustered together to start generating Golden Records. But this process may have already taken months before you see any results. By reducing the need for manual ETL and rule-tuning efforts, machine learning eliminates the most time-consuming part of data mastering and can start generating results within days or at worst, weeks.

Secondly, supervised machine learning allows for greater scalability when dealing with both large amounts and a large variety of data sources. A traditional rules-based approach can be effective for a reasonable number of datasets, but when the volume or variety of data sources being mastered changes, the rules authored before may no longer apply. Each new data source would then require its own data transformation, rule definition for matching, and rule definition for survivorship. By contrast, machine learning processes get smarter the more data is fed into it. Instead of the amount of time and effort scaling linearly with each new data source under a traditional rules-based MDM approach, the amount of time and effort reduces exponentially in a machine learning-based approach.

Lastly, supervised machine learning improves accuracy in data mastering. A rules-based data mastering solution can only match records that fit the exact conditions of the rules, and developers have to define the precise conditions of the rules (e.g., “these two entities can be matched if the names match by 50%, and the addresses match by 60%”). However, most rules typically work 80-90% as defined; in very large data volumes even a 5% error rate can be a big deal! With machine learning, patterns with much richer features than the hand-written rules can be discovered which can improve the accuracy by up to 98 or 99%. This improves trust in data and enables informed business decisions.

Supervised Machine Learning in Action: Comparing Time-To-Value in Customer Data Mastering

Let’s explore a scenario in the financial services industry where Supervised Machine Learning will significantly reduce overhead and time-to-value in data mastering. In this scenario, a bank needs to maintain a master list of all its customers, which includes information such as name, address, phone number, and account information. The bank receives customer data from multiple sources, including online applications, branch visits, and third-party providers.

A diagram showing supervised machine learning in action. Supervised machine learning is used to scan data inputs from multiple sources and create a golden record of the most accurate and correct information.

Without machine learning, the bank would have to manually review and cleanse the data, which would be a time-consuming and costly process. The bank would have to manually identify duplicates, correct errors, and standardize the data. This process would require a large number of data analysts and would take several months to complete.

However, by using supervised machine learning, the bank can automate the data cleansing process. The bank can provide the machine learning model with a set of labeled data, which includes examples of correct and incorrect data. The model can then learn to identify patterns in the data and make predictions about the data it has not seen before. The bank can also use the model to identify and merge duplicate records, correct errors, and standardize the data.

In this scenario, supervised machine learning can significantly reduce the cost and effort required to maintain a master list of customers. The bank can reduce the number of data analysts required and complete the data cleansing process in a fraction of the time it would take using traditional methods. This allows the bank to focus on more important tasks and make better use of its resources.

Supervised Machine Learning + Human Subject Matter Experts = Data Mastering ‘Nirvana’

Pairing supervised machine learning with a continuous process of “crowdsourcing” feedback from human subject matter experts can bring additional benefits to data mastering. Subject matter experts have a deep understanding of the data and the business context in which it is used. They can provide valuable insights into the data and help identify patterns and relationships that may be difficult for the machine-learning model to detect on its own.

The image reflects how supervised machine learning plus human subject matter experts leads to a data mastering nirvana.

One of the main benefits of pairing machine learning with subject matter experts is the ability to improve the accuracy of the model. Subject matter experts can provide the machine learning model with labeled data that is accurate and representative of real-world data. This can help the model learn to make more accurate predictions and reduce the number of errors. Additionally, subject matter experts can also help identify and correct errors made by the model during the training process, which further improves the accuracy of the model.

Another benefit of pairing machine learning with subject matter experts is the ability to improve the interpretability of the model. Machine learning models can be opaque and difficult to understand, which can make it challenging to trust the results and take action on them. Subject matter experts can help explain the model’s predictions and provide context to the results, which increases the transparency and interpretability of the model.

Finally, involving subject matter experts in the training process helps to improve the acceptance and adoption of the machine learning solution among the users. Subject matter experts can act as a bridge between the data science team and the end users, communicating the value of the solution and addressing any concerns that may arise. It also allows experts to set the right expectations and ensure the solution addresses real business needs.

Fluree Sense for Automated Data Mastering

Fluree Sense uses supervised machine learning trained by subject matter experts to ingest, classify, and remediate disparate legacy data. Fluree Sense is perfect for getting your existing legacy data into shape. Fluree Sense uses supervised machine learning to automatically identify and correct errors, merge duplicate records, and standardize data. The machine learning model is trained by human subject matter experts who provide labeled data that is accurate and representative of real-world data. This ensures that the model is able to make accurate predictions and reduce errors.

The image outlines the Fluree Sense advantage, showing a timeline of the fluree sense process compared to typical data management approaches. The fluree sense process generally spans over 8 weeks, led by 2 business analysts and two data engineers, at around 60 thousand dollars in labor cost. A traditional and typical approach spans over 30 weeks, using 3 business analysts and 6-8 data engineers, at around 650 thousand dollars in labor costs.

Fluree Sense offers an interactive interface that allows for easy monitoring of the data cleansing process and allows for real-time feedback and adjustments by human subject matter experts. This ensures that the data cleansing process is fully transparent and interpretable, which increases trust in the data and enables informed business decisions. Don’t wait to transform your data, start your Fluree Sense journey here!