In software development there has long been the understanding of a System Development Life Cycle. The software SDLC phases consist of Initiation, System Concept Development, Planning, Requirements Analysis, Design, Development, Integration and Test, Implementation, Operations and Maintenance and Disposition (Wikipedia Reference)
For the purposes of this discussion (because it is common not because it’s correct) we will focus only on DEV, TEST, UAT and implementation to PRODuction. These phases are where the data is most considered an important factor. In fact, more focus earlier (and later) in the process would avoid much enterprise time and cost.
Corresponding to this rather standard approach to software development, code is versioned, archived and tested via various means before UAT and promotion into a production environment.
When working with Enterprise Data Management (EDM) systems there is a similar but often underserved need. Often data is treated like the thirteenth ugly stepchild. Common quotes to developers are; “Just use some test data”, “Make up your own data sets”, “A sample set is good enough”.
Sometimes project managers and major stakeholders maintain that there is not enough time to get ‘real’ data, or that security restrictions forbid it and there is no time to ‘blind’ the data.
Blinded data, for those privileged few that aren’t familiar with the term, is data with it’s identifiable features removed. Sometimes, blinded data is worse than created test data. Where business experts are concerned, when Customer A accounts for 80% of the revenue, they either quickly realize who Customer A really is OR the data loses all relevance (often due to associated blinded amounts).
Security needs must be considered but also ask, what is the requirement and associated risk compared to a flawed system? Also, there are tools that will help where ‘blinding’ is necessary.
The lack of realistic and consistent data early in the development system wreaks havoc on a developer’s logic. Needless development effort is consumed to ‘code around’ data errors that were introduced in the test data creation and have nothing to do with the reality of data that will be loaded once in production. On the other hand, logic that should be implemented based on actual data is never considered due to the blind logic (pun intended) of test data.
Once the system is developed and it is almost time to test there is often a scramble to fix application and data ‘bugs’. As the data is modified to pass tests it becomes more consistent with the software code (see diagram Data SDLC; DEV phase), however, is it really more ‘correct’?
Through the test phase the data continues to become more consistent with developed code (see diagram Data SDLC; TEST phase). At the point where the Users signoff it is at it’s most consistent with software code. Now the data and code function well enough together to provide confidence for User Stakeholders to sign off on the production move.

Data Reliability SDLC
Hopefully, at this point the data is consistent, or better yet, production quality. If it is then the signoff is much more legitimate. Preferably this is in parallel to another system if it is a systems replacement project. However, if functionality has significantly changed data comparison will become difficult if not impossible. This is a reconciliation discussion for a later time.
The glorious day has arrived and you have migrated to production (let’s address the actual migration in more detail at a later time). If you haven’t had a period of parallel, and maybe even if you have, the sad fact is, it’s down hill from here my friend…and not in a good way. What I’m jesting is that, if real, timely production data hasn’t hit the system prior to production migration hold on to your hat. You could be in for a bumpy ride.
Once the system reaches a level of stability you will find the data reliability settles into an ‘Operational’ mode (see diagram Operational Data). Often, we find that the data reliability is at it’s worst the few minutes or hours after it’s loaded (that’s when we find the most serious issues) or after some period of time when it becomes out of date, increasingly updated, or otherwise modified beyond it’s original intention.

Data Reliability Operational
At the top of the data reliability curve are those beautiful days or weeks or months (gasp!) of full nights sleep where all the known issues have been fixed and time has not degraded the data reliability. These are the times for which we live.
Beware DML destruction! DML or Data Manipulation Language is often used to make those quick ‘fixes’ to insert, delete or update values, often without sufficient testing, audit or logs. DBAs will often, with just cause, protest these migrations with gusto. They know it could mean their next night’s sleep.
You will notice in the “Operations Diagram” a SET A and a SET B. These are representative, but you will often find after some time in production the data ‘acts’ a certain way. It has characteristics of data reliability, sometimes trending over time more steeply (SET A) or more linearly (SET B).
I’m convinced the data has emotions. Don’t rile it up. It might not be stable! Seriously, given the good news of Operational and SDLC Data should we simply find another line of work? In fact, once you get used to ‘your’ data, you’ll begin to notice these patterns. You will be able to just know from a report or load immediately whether or not the data is off. It’s like a sixth sense of data.
I recall an ‘operational data god’ that would get a notification and just from the file bite size contained in the subject he knew whether it was going to be a good day. He never had to open up the message.
But wouldn’t it be better if we could use tools to reduce that burn-in/learning cycle? Wouldn’t it be nice to know the issues before the customer reported them? Wouldn’t it be wonderful for test data and scripts to be like production and vice versus? There are tools and techniques that will help but that is a topic for later discussion.