Monday, September 22, 2008

Master Data Management (CDI/IR/PIM)

What is Master Data Management

It's the process of managing "Master Data". 

Master Data Management is a technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise's official shared master data assets.

What is Master Data

Enterprize data can be broadly categorized as :

Transactional data are the elements that support the on-going operations of an organization and are included in the application systems that automate key business processes. This can include areas such as sales, service, order management, manufacturing, purchasing, billing, accounts receivable, and accounts payable. Commonly, transactional data refers to the data that is created and updated within the operational systems. Examples of transactional data included the time, place, price, discount, payment methods, etc. used at the point of sale.

Analytical data are the numerical values, metrics, and measurements that provide business intelligence and support organizational decision making. Typically analytical data is stored in Online Analytical Processing (OLAP) repositories optimized for decision support, such as enterprise data warehouses and department data marts. Analytical data is characterized as being the facts and numerical values in a dimensional model. Normally, the data resides in fact tables surrounded by key dimensions such as customer, product, account, location, and date/time. 


Master data is your most important data it plays a critical role in the core operation of a business. Master data refers to the prime entities that are used by several functional groups in your enterprise and are typically stored in different data systems across an organization. 
In other words, it represents the business entities around which the organization’s business transactions are executed and the primary elements around which analytics are conducted. 
Master data is typically persistent, non-transactional data utilized by multiple systems that define the primary business entities. 
Master data may include data about customers, products, employees, inventory, suppliers, and sites.
 

This year we have seen emerging trends in CDI & PIM. Let's understand what are they? Why do enterprises need it or why is it such sought after and how do they really qualify in the MDM ecosphere.

CDI/IR/MDM

Customer data integration (CDI) and master data management (MDM) are getting significant buzz in both information technology (IT) and business circles. 
Master data management represents the data management disciplines and processes, and 
CDI represents the authoritative system for customer data. 
Both concepts are tightly linked.

It all started with CDI and CDI started with CIF or Client Information File. Where we store information for a customer in a digitized file. This information can be accessed by various applications across the enterprise. However, often different applications tend to store the client information in their own way and format which is neither accessible nor sharable with other applications. Hence client data integration is required to share the information across multiple applications.

Identity resolution (IR) is an operational intelligence process, typically powered by an entity resolution engine or middleware, whereby organizations can connect disparate data sources with a view to understanding possible entity matches and non-obvious relationships across multiple data silos. It analyzes all of the information relating to individuals and/or entities from multiple sources of data and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non-obvious relationships exist between those identities.

PIM/MDM

Product information management (PIM) is the process of managing all the information required to market and sell products through distribution channels. This product data is created by an internal organization to support a multichannel marketing strategy. A central hub of product data can be used to distribute information to sales channels such as e-commerce websites, print catalogs, marketplaces such as Amazon and Google Shopping, social media platforms like Facebook and electronic data feeds to trading partners.

PIM solutions are most relevant to business-to-consumer (B2C) and business-to-business (B2B) firms that sell products through a variety of sales channels in a range of industries.
The use of PIM is generally influenced by a company's:
  • A wide array of products and/or complex product data set
  • Frequently changing product characteristics
  • An increasing number of sales channels
  • Non-uniform IT infrastructure 
  • Online business and electronic ordering
  • Various locales and localization requirements
PIM systems manage customer-facing product data needed to support multiple geographic locations, multi-lingual data, and maintenance and modification of product information within a centralized catalog. Product information kept by a business can be scattered throughout departments and held by employees or systems, instead of being available centrally; data may be saved in various formats, or only be available in hard copy form. Information may be needed for detailed product descriptions with prices, or calculating freight costs. PIM represents a solution for centralized, media-independent product data maintenance, as well as efficient data collection, enrichment, data governance, and output

Key concepts of MDM

Let's revisit the basics of Master Data Management & Data Governance! 

MDM Eco-System

The MDM Eco-System as I like to call it mainly evolves around processes like Data Quality, Data Cleansing, De-duplication & Data Stewardship.

Data Quality

There are 7 dimensions of Data Quality:
  1. AccuracyThe degree of conformity of a measure to a standard or a true value - see also Accuracy and precision. Accuracy is very hard to achieve through data-cleansing in the general case because it requires accessing an external source of data that contains the true value: such "gold standard" data is often unavailable. Accuracy has been achieved in some cleansing contexts, notably customer contact data, by using external databases that match up zip codes to geographical locations (city and state) and also help verify that street addresses within these zip codes actually exist.
  2. CompletenessThe degree to which all required measures are known. Incompleteness is almost impossible to fix with data cleansing methodology: one cannot infer facts that were not captured when the data in question was initially recorded. (In some contexts, e.g., interview data, it may be possible to fix incompleteness by going back to the original source of data, i.e. re-interviewing the subject, but even this does not guarantee success because of problems of recall - e.g., in an interview to gather data on food consumption, no one is likely to remember exactly what one ate six months ago. In the case of systems that insist certain columns should not be empty, one may work around the problem by designating a value that indicates "unknown" or "missing", but the supplying of default values does not imply that the data has been made complete.)
  3. ConsistencyThe degree to which a set of measures are equivalent in across systems (see also Consistency). Inconsistency occurs when two data items in the data set contradict each other: e.g., a customer is recorded in two different systems as having two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires a variety of strategies - e.g., deciding which data were recorded more recently, which data source is likely to be most reliable (the latter knowledge may be specific to a given organization), or simply trying to find the truth by testing both data items (e.g., calling up the customer).
  4. ConformityCan be ensured by enabling validation constraints which fall into the following categories:
    • Data-Type Constraints – e.g., values in a particular column must be of a particular data type, e.g., Boolean, numeric (integer or real), date, etc.
    • Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.
    • Mandatory Constraints: Certain columns cannot be empty.
    • Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.
    • Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person's gender may be Female, Male or Unknown (not recorded).
    • Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the "state" column is required to belong to one of the US's defined states or territories: the set of permissible states/territories is recorded in a separate State table. The term foreign key is borrowed from relational database terminology.
    • Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999-9999.
    • Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient's date of discharge from the hospital cannot be earlier than the date of admission.
  5. Concurrency
  6. Duplication
  7. Integrity: The term integrity encompasses accuracy, consistency and some aspects of validation.

Data Cleansing

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Wikipedia


Data De-Duplication


Data Survivorship


Data Stewardship


Data Lineage


-
Kinshuk Dutta
Kolkata



Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)

Managing & Analyzing Big Data using Apache Scala & Apache Spark In this blog we will see how to use Scala and Spark to analyze Big D...