…of Structure and Content.

February 1, 2012 by

Ran across this interesting article posted in the MIS Class Blog by Antonio Montanez. The MIS Class Blog was created by Sonya Zhang and her students at Cal State Fresno and Cal Poly Pomona.

First, it is encouraging to have students focused on Data Management.  Thanks to Antonio for writing the article and reminding us of Malcolm Chisholm’s work (He published another book recently, ‘Definitions in Information Management’ that can be found on Amazon).  Data Management is a field that is under served and can always use fresh talent.  Sure there are the tool fads and vendor hype but far too few real practitioners of data content management.

When you talk about reference data you quickly encounter the difference between STRUCTURE (the data model and even physical tables) and CONTENT (the values in the columns).

Antonio reminds us accurately, from Malcolm’s article, that reference data is not visible in the data model.  We cannot stress enough the danger of relying solely on the data model entity relationship diagram (ERD) for understanding.  In fact the ERD is extremely restrictive in the facts it presents.  It is like looking at a building’s architecture diagram and expecting you ‘know’ what the house will look like and contain when completed.   Another analogy is that the Data Model is like a someone’s shadow.  You get the basic characteristics of a person but could not know their eye color or their smile.

Take one example of currency (CCY) for instance.  On most data models you would just see the attribute CCY of type text and size 3 (if it is ISO 4217).  Rather innocuous.  This may be down in the list and seem relatively unimportant.  However, in a financial data model, you will usually find it links back to detail transactions.  If missed in development, at some point in testing, the realization of the importance of Currency will manifest itself.  In fact, it likely will be a part of determining the uniqueness of a data row. 

Reference Data Content strongly introduces the issue of Data Governance and Quality.  How will CCY be populated and maintained?  Given there is an ISO Standard is that what will be utilized?  Are conversions or cleansing necessary upon the source data?  

I once had a mentor and, despite this one habit, am really appreciative of what he shared.  His bad habit was that he would design a logical model, without looking at the data in any detail, and proclaim it was “90% done”.  “You can finish off the last 10%” he would say.  Quite often, much of the “10%” was data content analysis and reference data.  Building a data model (structure) without looking at the reference data values (content) is to data management what an architectural drawing is to a finished and occupied building.  An important start but by no means 90% complete.

Thanks, Antonio, for the Reference Data Focus.  And to those looking for a long career consider opportunities in data management.  Every day information becomes larger and more complex.  There are vast areas of unstructured data that we have not yet begun to tame.  There is large data and virtualization, to name a few, that will result in advances to data management. 


Data Quality: Why is the water brown?

April 20, 2010 by

At one point in life I lived in a house with a natural spring well for water supply.  In the spring when the rain was insistent the water coming out of the faucet would turn brown.  Needless to say, we didn’t drink the water. 

After a time we installed a filtration system and the water was clear.  We still didn’t drink the water though there was the impression that because the output  was clear that is was pure.  Interestingly enough, since we had installed a filter (but not complete purification), we inadvertently added a maintenance step to a formerly maintenance free system.  Now we had the appearence of ‘pureness’ along with added cost without actually benefiting from a ‘pure’ water supply.  Does that sound like some of the data issues you grapple with?

What is the point of this in a MDM Blog post?  Well, the Enterprise Data Ecology works much the same way.  Somehow data got into the system.  Either by users, automation (capture) or external feeds.  Is it ‘pure’?  Because it seems ‘clear’ doesn’t mean it’s pure.  What steps are in place to filter, purify and test?  If your company’s data was water coming out of a faucet would you drink it?

In the past I’ve assisted firms with evaluations of their data quality, processes, governance and recommended and implemented opportunities for improvement.  Several of these involved external data.  At one point I was approached about developing a repeatable quality audit solution for a specific data provider.   Though this was never funded I have helped several firms ‘reassess’ their SLA/Contracts with various providers helping them improve overall data quality while reducing external data costs (through reduced fees corresponding to quality benchmarks).

Today I noticed a good article for those incorporating external data, “External Data in Enterprise Data Warehousing – Trendy and Trying” on information-management.com.  Mr Ramakrishnan does a good job of pointing out some of the major challenges and issues when bringing external data into your organization.  He also provides practical advice for incorporating external data.  NOTE:  I have no relationship with Mr. Ramakrishnan and submit this as my (and mine only) objective opinion.

I hope this article and my plumbing analogy will help next time you are considering bringing in external data.  Note that the point of this post is not to discourage external data integration.  Indeed, these days a closed off enterprise is not realistic in most cases.  However, remember that the cost isn’t simply the cost of the data feed.  There is an added maintenance cost of increased complexity and management.  In addition, think of the potential cost of ‘organizational sickness’ if dirty data impacts revenues, costs, risk or compliance. 

On the other hand, weigh the potential ‘lost opportunity cost’ if your organization remains in it’s pond of denial.  I know, that is a weak pun and I’ll stop now while I’m (hopefully) ahead.

Data SDLC for non-dummys!

November 25, 2009 by

In software development there has long been the understanding of a System Development Life Cycle.   The software SDLC phases consist of Initiation, System Concept Development, Planning, Requirements Analysis, Design, Development, Integration and Test, Implementation, Operations and Maintenance and Disposition (Wikipedia Reference)

For the purposes of this discussion (because it is common not because it’s correct) we will focus only on DEV, TEST, UAT and implementation to PRODuction. These phases are where the data is most considered an important factor. In fact, more focus earlier (and later) in the process would avoid much enterprise time and cost.

Corresponding to this rather standard approach to software development, code is versioned, archived and tested via various means before UAT and promotion into a production environment.

When working with Enterprise Data Management (EDM) systems there is a similar but often underserved need. Often data is treated like the thirteenth ugly stepchild. Common quotes to developers are; “Just use some test data”, “Make up your own data sets”, “A sample set is good enough”.

Sometimes project managers and major stakeholders maintain that there is not enough time to get ‘real’ data, or that security restrictions forbid it and there is no time to ‘blind’ the data.

Blinded data, for those privileged few that aren’t familiar with the term, is data with it’s identifiable features removed. Sometimes, blinded data is worse than created test data. Where business experts are concerned, when Customer A accounts for 80% of the revenue, they either quickly realize who Customer A really is OR the data loses all relevance (often due to associated blinded amounts).

Security needs must be considered but also ask, what is the requirement and associated risk compared to a flawed system? Also, there are tools that will help where ‘blinding’ is necessary.

The lack of realistic and consistent data early in the development system wreaks havoc on a developer’s logic.  Needless development effort is consumed to ‘code around’ data errors that were introduced in the test data creation and have nothing to do with the reality of data that will be loaded once in production. On the other hand, logic that should be implemented based on actual data is never considered due to the blind logic (pun intended) of test data.

Once the system is developed and it is almost time to test there is often a scramble to fix application and data ‘bugs’. As the data is modified to pass tests it becomes more consistent with the software code (see diagram Data SDLC; DEV phase), however, is it really more ‘correct’?

Through the test phase the data continues to become more consistent with developed code (see diagram Data SDLC; TEST phase). At the point where the Users signoff it is at it’s most consistent with software code. Now the data and code function well enough together to provide confidence for User Stakeholders to sign off on the production move.

Data Reliability SDLC

Data Reliability SDLC

Hopefully, at this point the data is consistent, or better yet, production quality. If it is then the signoff is much more legitimate. Preferably this is in parallel to another system if it is a systems replacement project. However, if functionality has significantly changed data comparison will become difficult if not impossible. This is a reconciliation discussion for a later time.

The glorious day has arrived and you have migrated to production (let’s address the actual migration in more detail at a later time). If you haven’t had a period of parallel, and maybe even if you have, the sad fact is, it’s down hill from here my friend…and not in a good way. What I’m jesting is that, if real, timely production data hasn’t hit the system prior to production migration hold on to your hat. You could be in for a bumpy ride.

Once the system reaches a level of stability you will find the data reliability settles into an ‘Operational’ mode (see diagram Operational Data). Often, we find that the data reliability is at it’s worst the few minutes or hours after it’s loaded (that’s when we find the most serious issues) or after some period of time when it becomes out of date, increasingly updated, or otherwise modified beyond it’s original intention.

Data Reliability Operational

Data Reliability Operational

At the top of the data reliability curve are those beautiful days or weeks or months (gasp!) of full nights sleep where all the known issues have been fixed and time has not degraded the data reliability. These are the times for which we live.

Beware DML destruction! DML or Data Manipulation Language is often used to make those quick ‘fixes’ to insert, delete or update values, often without sufficient testing, audit or logs. DBAs will often, with just cause, protest these migrations with gusto. They know it could mean their next night’s sleep.

You will notice in the “Operations Diagram” a SET A and a SET B. These are representative, but you will often find after some time in production the data ‘acts’ a certain way. It has characteristics of data reliability, sometimes trending over time more steeply (SET A) or more linearly (SET B).

I’m convinced the data has emotions. Don’t rile it up. It might not be stable! Seriously, given the good news of Operational and SDLC Data should we simply find another line of work? In fact, once you get used to ‘your’ data, you’ll begin to notice these patterns. You will be able to just know from a report or load immediately whether or not the data is off. It’s like a sixth sense of data.  

I recall an ‘operational data god’ that would get a notification and just from the file bite size contained in the subject he knew whether it was going to be a good day.  He never had to open up the message.

But wouldn’t it be better if we could use tools to reduce that burn-in/learning cycle? Wouldn’t it be nice to know the issues before the customer reported them?  Wouldn’t it be wonderful for test data and scripts to be like production and vice versus?  There are tools and techniques that will help but that is a topic for later discussion.

Poll: Who is your Data Steward?

June 17, 2009 by

Interview with Paul Billingham of MDM Vendor Orchestra Networks

June 8, 2009 by

I had the opportunity to remote interview Paul Billingham of Orchestra Networks last week.

Orchestra Networks EBX.Platform provides a Model Driven approach to MDM/Data Governance. Among their features is full versioning at the attribute level, security and auditing, hierarchy management and of course a business oriented focus for MDM and Data Governance via their “Model Driven” approach.

Orchestra is what is considered a ‘Generalist MDM Solution’.  To provide a little background many of the current products on the market are Specialist or Niche products originating from CDI (Customer Data Integration) or PIM (Product Information Management) and hence specialize on solving problems in their corresponding problem space.  Orchestra’s EBX.Platform is designed to provide a more multi-domain or generic solution.  Here is a post by Andrew White that provides more insight.

Here is a rough-edited interview that had to be aggressively cut to meet YouTube time constraints. Hopefully, you’ll get the idea:

Orchestra’s North American System Integration/Consulting Partners are Business and Decision and Sense Corp.  Their technology partners include Informatica and Software AG’s Webmethods.  In addition, they are alliance members of MDM Alliance Group.

Orchestra Networks was included in Gartner’s list of “Cool Vendors in Master Data Management”, 2008 by Andrew White and John Radcliffe. This research highlights cool vendors and technologies that address aspects of the master data management (MDM) market or the needs of MDM project leaders.

I haven’t used this software but embrace the model driven approach as a way to get business engaged with technology creating an MDM approach where the requirements gathering becomes the working deliverable.  In addition, facilitating rapid MDM development, identification of appropriate Data Stewards, workflow creation and utilizing versioning and role based security seems a leap-frog approach.

Data Governance Info and Links

April 5, 2009 by

Spend more than a few minutes in the MDM, Reference Data, Data Management world and you’ll bump (or get slammed) into the need for Data Governance. In fact, managing data from a technical standpoint without it’s corresponding discipline of governance is of little consequence. An enterprise can’t “fix IT” or “manage IT” if there are no controls.

In the process of researching, educating or strategizing I have run across this resource by Gwen Thomas that I’ve referred to often and in this post wish to give credit and share with you in case it’s of use:

Also, the glossary is excellent for covering terms in our specific area of interest.

This is an objective recommendation as I don’t know and haven’t met Gwen personally (yet) so this is a based on my personal use and the fit of the content provided.

It fits with my core principal that I.T. should provide the tools to allow “Business” to manage enterprise information. As a cross-functional team the entire organization’s data management can be improved.

MDM and the Information Architecture

February 16, 2009 by

Some interesting discussions have come up recently regarding MDM and ERP as well as MDM and EIM (Enterprise Information Management).

Here is a definition in a recent article by Andrew White that discuss MDM’s place in EIM: 

…”Enterprise Information Management (EIM) is a business oriented information strategy that is adopted when a firm decides to manage information as an asset for reuse.”

In short, if MDM is considered ONLY a cost center…an I.T. initiative intended to lower costs and sponsored by I.T. Leadership…can it survive a Business/Customer focused change of initiatives?  On the other hand, if MDM is *known by* and actively supported by business as an integral part of an Information Architecture AND a potential revenue generator it has a much greater chance of continued support and success. 

Additionally, there is an article on ERP vs. MDM (ERP and MDM) that discusses the seperate data models (and requirements) for:

  • ERP data model used as MDM
  • MDM as seperate data model
  • Business intelligence model

In conclusion, for all of the above, there is no *right* answer that does not include the specific context and requirements of the organization involved.   As usual, the sooner one can understand the requirements (the what) the sooner a realistic information architecture (the how) can be determined.

MDM glass half full or half empty?

February 12, 2009 by


In a recent article by Thomas Wailgum titled “Data Management Danger: Less than half of MDM Plans are effective” he discusses a survey (123 business and IT leaders at companies with annual revenues of $500 million or more) conducted by IDG Research Services, which was sponsored by vendor Kalido.

Some Stats from the article below:

  • 47 % rating their organization’s data-management efforts as “effective or very effective.”
  • less than 33% of businesses have taken steps to remedy the situation with a data-governance program
  • 13% of respondents said they were unclear as to what data governance was

which department should be responsible for maintaining the accuracy of enterprise data?:

  • 31% of respondents said the role is held by the IT department
  • 25% a cross-functional team
  • 10% finance
  • 6 % manufacturing
  • 6% sales
  • 3 % marketing
  • 10 % nobody is in charge

save money and increase revenues with data-management policies

  • an average of $38 million in cost savings or revenue increase from current or planned initiatives reported
  • 77% of respondents expect savings to occur on an annual basis
  • 45% expect an annual increase in revenue.

business initiatives include:

  • 63%regulatory/statutory reporting
  • 45% cost-reduction activities
  • 35% cross-sell/up-sell campaigns

Clearly there is opportunity for further education and improvement.  One of the primary issues I’ve encountered are companies attempting to solve the Master/Reference Data Issue by throwing tools at it.  Technology can help produce a more consistent, scaleable and organized approach but it must also be accompanied by high level Stakeholder Support (in Governance and $$$), Data Stewardship (including business and technology teams) and a reasonable plan for implementation.

In short, I.T. needs to provide the tools but only a combined team of business and technology can solve the enterprise wide issue of ‘managed’ data.

Technology used on MDM Projects

February 8, 2009 by

As an Architect/Technical PM I’ve used quite a few tools in implementing MDM, Data Warehouse and BI solutions.  For the first post I will list tools I’ve used on various MDM Projects.  Later posts will start to include more details.   I encourage others to participate by posting comments regarding tools you’ve used on MDM related projects:

Informatica – ETL, Data Quality and Metadata Manager.  Who hasn’t been on a project where this tool was used?  Has come along way over the years adding performance and functionality.

Siperian – MDM Hub – Once known as a CDI (Customer Data Integration) tool has become one of the most known MDM tools.

Teradata – I’ve done more ‘working around’ this product.  Sourcing from and delivering data to implementations.  However, one of the most established of the largest ‘enterprise scale’ data warehousing solutions

GlobalIDs – I have known the founder for several years seeing GlobalIDs grow from a small suite to the most complete (that I know of) MDM suite built (not bought and assimilated) on an ‘agent processing’ architecture.

Netezza – a ‘Database Appliance’ – My first introduction was at a client where the refrigerator sized appliance was delivered about 9AM.  It was set-up and data imported by noon.  Around 1PM we pointed the existing Microstrategy Reporting application to the new DB and were running reports by 2PM.  Not your grand daddy’s environment install as they say.  This was the fastest performing of any database I’ve ever used.  ‘Like’ queries (tablespace scans in common DBs) were returning in seconds against 40 million row tables.  It made development/data analysis/profiling a joy.  Saved a bunch on development labor costs and project timelines.

There are others related to MDM that I’ve been exposed to for POCs (Proof of Concepts), evaluation and architecture review.  I will add more later but a few on note:

QlikView – This is an in-memory DB BI tool.  Used to eliminate ETL and provide flexible and powerful front-end

Talend – This is an Open Source “Data Integration” also have added Data Quality component.

Pentaho – I knew them from ‘Kettle’.  They have grown to call themselves an “Open Source BI Suite” that includes ETL, Reporting, OLAP and Data Mining

Teleran – This is not directly an MDM tool.  Teleran iSight allows you to see all interaction  to the database (Users, Queries, Rows returned, time, etc).  Have used to look at ‘real usage’ of Data Warehouse/Mart as opposed to what Business Analysts were ‘told’ was being used.  Also, good for helping DBAs performance tune from the Application standpoint.   Also, has an iGuard product to control access to database environments by user, application (as in excluding M$ Access, Toad from production ad hoc queries).

These are a few not to exclude MS SQL Server, Oracle, IBM DB2, mySQL, Postgres, etc…

MDM Tools

February 8, 2009 by

This category will hold information, news and commentary about technology used in MDM