Data Management 5: Master Data Management
This is the fifth blog post in the Data Management series. We have covered several basic topics so far. This time we zoom in on one of the sub disciplines of Data Management: Master Data Management.
Loosely defined – and following Loshin’s excellent book on MDM – Master Data are those core Entities often used in different applications across the enterprise, along with their associated meta data, attributes, taxonomies etcetera. Master data management (MDM), then, is control overmaster data to enable consistent, shared, contextual use across systems of the most accurate, timely and relevant version of truth about essential business entities.
In order to achieve such ambitious goals, MDM requires identifying and/or developing a ‘golden’ record of truth for each of those key Entities (i.e., “product”, “place”, “person”, or “organization”). In some cases, a ‘system of record’ provides the definitive data about an instance. But even if that is the case, even this single system may accidentally produce more than one record about the same instance which is an obvious deviation of the ideal situation with a single golden record about key entities.
MDM is often seen as a technical discipline because a lot of (complex) data manipulation is involved. For example, two systems may both store Customer Data:
- In one system the name of a customer is exactly one field in the database, whereas the other system has fields for first name and last name.
- In the first system, customers are identified (primary key) by a social security number, in the other they have a unique customer id that is generated for each new customer (data quality question: how do you make sure the same customer does not exist twice, with two separate customer id’s?)
- Both systems have an address for customers. However, are we 100% sure that both have the same meaning? One could be the shipping address, whereas the other is the billing address!
These – and many more – data issues can be resolved by creating a golden record of an Entity. The data manipulation that was mentioned earlier would include such things as profiling data from both systems to figure out which system has the highest quality of the data, whether the semantics of addresses are the same and so on. Even more, data will be transported to the system that will hold the golden record. It may have to be transformed (data formats, integrity rules), meta-data has to be stored, and so on. The information system that actually takes care of this data manipulation as well as supports the data steward in his workflow is called an MDM system.
This is all true, and the technical work involved is indeed complex. However, the business part of this work is where it gets interesting: the Data Steward will have to work with various stakeholders in the organization to define which Entities should be mastered, what the requirements are with respect to this data (where is it used, what quality is required), and so on.
On the architecture side it is also worth mentioning that there are many architectural patterns for MDM systems. Deciding whether the MDM system actively manages operational data, or whether it holds a copy:
Fleshing out the details of these patterns is something for another blog post. Let us know if you’re interested in seeing example models that illustrate the main principles. Last but not least, the following table shows the main implementation styles for MDM systems:
From a modeling perspective, there are three key questions that need answering: (1) how to model a golden record, (2) how to model the lineage of data objects, and (3) how to link it to an MDM-system?
Golden Records are Data Objects and are therefore modeled using the DataObject concept. As such there is nothing special about them, except for the fact that it’s instances are of the highest quality and have additional meta-data (such as lineage, identifying fields for the original data stores etc.). To recognize it visually on architecture views, the graphical shape for all Golden Records should be visually appealing. For example:
The same goes for the MDM system: to distinguish it from other systems, the same crown-icon can be added to the top-right corner of the ApplicationComponent concept. It should be noted that MDM systems by various vendors have different functionality – obviously with some overlap. The functionality of such systems should be modeled (as is the case with any regular ApplicationComponent) using ApplicationFunctions.
More challenging is to model the lineage of Data Objects. The trick is that we want to represent how fields of several Data Objects are transformed into the Fields of the Golden Record. It goes too far to flesh out the actually transformation rules
but we should at least be able to pinpoint “what goes where” (i.e., data placement, data movement, etc.).
The UsedBy relation may have the most natural feel to it, but has different semantics in ArchiMate; it is actually not allowed between two DataObjects. The Flow relation also has its merits, but modeling flow is also syntactically not allowed between DataObjects. We found that the use of (a profiled version) the AggregationRelation between DataObjects works well in this context. The intended meaning of this relation is “information from the source DataObjects is aggregated into the target, Master DataObject”.
The profile ideally changes this graphic shape of the relation (dashed line, perhaps slightly thicker line width). This way it cannot be confused with a regular AggregationRelation. This works at two levels:
- Between the Fields of the Golden Record and the Fields from DataObjects that it consists of
- Between the Golden Record and the DataObjects of whose Fields it consists of (which is a derived relation)
The following diagram illustrates this:
Using these mechanisms with modern tooling such as BiZZdesign Architect, we should be able to generate a “tree view” that shows the entire lineage of an Entity or a Data Object. This is particularly useful for the Entity and the Golden Record. It will be fairly straightforward to automatically generate the following views:
- The Entity (BusinessObject) and the DataSteward (BusinessRole) who is responsible for it
- The Golden Record (DataObjects) that realizes it (RealizationRelation)
- For the Golden Record, show the DataObjects that it is made up from (AggregationRelation with a transformation profile)
- For all Data Objects, show the Fields (DataObjects) with a label view
- For all Data Objects, also include the associated Systems by drawing the ApplicationComponents and the AccessRelations
Along the same lines, the lineage can be visualized in a data movement type diagram:
- Select any Data Object and generate a view that shows in which System it is managed. Draw this ApplicationComponent on the canvas
- From the originally selected Data Object, follow the AssociationRelations with the transformation profile to find out if it is used to form another Data Object elsewhere in another System. Draw this other ApplicationComponent. Also draw a FlowRelation between the ApplicatonCompents and label this relation with the name of the DataObject.
- Repeat this step until a Golden Record is found. Draw this Golden Record (DataObject) in the last ApplicationComponent in the chain of Systems.
The end result could be something along these lines:
Part of what we covered so far (e.g., keeping track of the lineage of data, stewardship etc.) is considered to be meta data. This is a broad topic in itself and we will give a brief introduction in the next posting in this series. If you have any questions or suggestions, please drop us a note!
1) In practice these rules are pertinent to the design of so-called ETL processes. ETL stands for “Extraction (of data from a source), Transformation (to the target format) and Load (in the target system).